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Though  some  progress  has  been  made  towards  producing  natural-sounding  speech 
using  the  linear  prediction  (LP)  techniques,  an  appropriate  feature-based  parameiric 
excitation  source  has  not  been  fully  developed  for  this  type  of  synthesizer.  The  intent  of  this 
research  is  to  verify  the  importance  of  selected  acoustic  measures  by  means  of  LP  analysis 
within  the  source-filter  theory,  and  then  use  the  deduced  information  to  develop  an  LP 
synthesizer  that  is  capable  of  synthesizing  high-quality,  natural-sounding  speech. 

In  order  to  carry  out  the  requirements  of  this  research,  we  divide  the  relevant  issues 
into  two  separate  but  related  phases.  In  the  first  phase  we  propose  methods  for  isolating  and 
extracting  the  acoustic  features  of  vocal  quality.  Based  upon  a comprehensive  speech 
production  model,  the  LP  analysis  is  used  to  estimate  spectral  properties  of  the  speech  signals 

retrieved  from  the  integrated  residue.  An  algorithm  is  developed  to  extract  the  time  domain 
characteristics  of  the  vocal  noise.  Various  aspects  of  such  extracted  noise  are  examined 
subsequently.  To  illustrate  the  above-mentioned  analysis  techniques,  the  measured  acoustic 
parameters  of  the  proposed  speech  model  for  three  voice  types  (modal,  vocal  fry,  and 


breathy)  are  provided  as  representative  examples.  It  is  anticipated  that  our  findings  will 
contribute  to  the  understanding  of  the  problems  of  modeling  the  excitation  source  and  the 
LP  synthesizer. 

In  the  second  phase  we  propose  a novel  source  model  to  simulate  the  residue  signal 
in  terms  of  the  glottal  phase  characteristics.  Depending  on  the  voicing  condition  of  the 
analyzed  speech,  the  excitation  source  is  formulated  as  two  separate  codebooks.  i.e..  a glottal 
codebook  for  voiced  segments  and  a stochastic  codebook  for  unvoiced  and  silence  segments. 
Methods  for  determining  voicing  intervals  arc  presented,  along  with  procedures  for 
searching  the  codewords  for  the  appropriate  excitation.  Since  pitch  synchronous  schemes 
are  preferred  for  speech  synthesis,  we  describe  procedures  for  identifying  the  instants  of 
glottal  closure  and  for  interpolating  the  excitation  pulses  ns  well  as  the  LP  coefficients. 
Moreover,  we  account  for  the  effects  of  vocal  noise  and  source-tract  interaction,  which  arc 
generally  ignored  in  most  synthesizers.  Finally,  a method  for  determining  the  voicing  gain 
is  given.  This  method  also  serves  as  an  expository  tool  to  explicate  the  relationship  between 
the  gain  and  power  intensity.  Informal  listening  tests  were  used  to  evaluate  the  speech 
processing  techniques.  The  listening  tests  revealed  that  the  quality  of  synthetic  speech  was 
close  to  that  of  the  original  speech.  The  results  indicate  that  our  source  model  is  able  to 
characterize  the  glottal  features  and  that  the  overall  speech  production  model  is  quite 


adequate  for  high-quality  synthesis. 


CHAPTER  1 
INTRODUCTION 


Speech  is  a sophisticated  skill  that  humans  have  developed  for  efficient 
communication.  This  skill  transmits  not  only  linguistic  information  but  also  acoustic 
features  that  convey  the  speaker's  identity  and  other  aspects  of  the  speaker's  physical  and 
emotional  state.  Although  our  current  knowledge  is  insufficient  to  unveil  the  linguistic  rules 
that  describe  the  phonatory  system,  the  mechanism  of  speech  production  is  becoming 
comprehensible  due  to  the  advances  in  acoustic  theory  and  computing  technologies. 
Phonatory  acoustics  forms  the  basis  for  all  present-day  speech  synthesizers.  The  increasing 
use  of  speech  synthesizers  in  the  marketplace  has  produced  great  demand  for  products  that 
can  generate  "high-quality"  speech.  In  fact,  quality  degradation  with  regard  to  existing 
speech  synthesizers  mostly  results  from  unnatural-sounding  characteristics,  which  are 
known  to  cause  perceptual  difficulties  (Pisoni  and  Hunnicutt,  !980;Pisonictal.,  1983).  The 
aim  of  this  dissertation  was  to  seek  methods  that  could  improve  the  naturalness  of  synthetic 
speech.  We  restricted  our  attention  to  the  Linear  Prediction  (LP)  technique  because  of  its 
simplicity  and  accuracy  in  speech  processing. 

Following  a brief  introduction  to  speech  production,  we  give  an  overview  of  some 
existing  synthesis  techniques.  Then  we  provide  the  details  of  speech  modeling  and 
processing.  This  overview,  associated  with  the  basic  knowledge  of  the  phonatory  system, 

1.1  Speech  Production  Mechanism 

Perhaps  the  easiest  way  to  describe  the  speech  production  mechanism  is  to  explain 
the  physiological  function  of  the  anatomy  of  the  human  vocal  system.  In  general,  the  speech 


acoustic  modulation,  which  shapes  the  excitation  spectrum  to  form  intelligible  sounds  (or 
phonemes). 


The  lungs  act  as  an  air  reservoir,  expelling  air  up  the  trachea  to  the  vocal  folds. 
During  periods  of  voiced  speech,  tile  vocal  folds  open  and  close  in  a quasi-pcriodic  fashion, 
producing  a pulsating  airstream.  During  the  periods  of  unvoiced  speech,  the  vocal  folds  are 
held  apart  so  that  the  airstream  is  less  disturbed  and  can  be  considered  as  a steady  turbulent 
source.  The  vocal  folds  have  an  important  role  in  determining  the  characteristics  of  vocal 
quality.  The  oscillation  of  the  vocal  folds  can  be  described  by  the  aerodynamic-myoelastic 
theory  (Berg  et  al„  19S7;  Berg,  19S8).  Basically,  the  motion  of  the  vocal  folds  is  controlled 
by  several  interplaying  forces  that  cause  the  abduction  and  adduction  of  the  folds.  When  the 

is  then  released  through  the  glottis.  The  volume  velocity  of  air  passing  through  the  glottis 
increases  as  the  vocal  folds  keep  opening.  As  the  velocity  increases  beyond  some  threshold, 
pressure  across  the  folds  begins  to  drop  and  then  results  in  a Bernoulli  effect.  This  effect, 

folds  at  the  time  these  two  effects  outweigh  the  subglottal  pressure.  When  the  vocal  folds 
close,  the  subglottal  pressure  builds  up  again  and  the  entire  procedure  repeats.  Such  a 
repetitive  cycle  is  referred  to  as  a pitch  period;  its  reciprocal  is  denoted  as  the  fundamental 
frequency. 

Noise  generated  by  turbulence  is  another  important  source  of  speech  production. 
The  airflow  emerging  from  the  lungs  can  cause  turbulent  streaming  while  passing  through 
a vocal  aperture,  which  is  either  the  vibrating  vocal  folds  or  a constriction  along  the  vocal 
tract.  Such  turbulence  ceases  if  the  vocal  aperture  opens  sufficiently  or  the  airflow  decreases. 
The  possibility  of  turbulent  flow  is  indicated  by  the  value  of  the  Reynolds  number,  which 


characterizes  the  viscosity  of  the  airstrcam  as  either  laminar,  turbulent  or  somewhere  in 
between  (Achcson,  1990).  With  these  kinds  of  characteristics,  there  is  no  doubt  that  the 
turbulence  becomes  an  essential  clement  in  fricative,  aspiralive,  plosive,  whisper  and 
breathy  sounds.  This  fact  necessitates  the  use  of  a noise  source  while  synthesizing  those 
particular  sounds. 


The  human  vocal  tract,  extending  from  the  glottis  to  the  lips,  can  be  considered  as 


to  this  time-varying  change  include  the  lips,  jaw,  tongue,  velum  and  nasal  cavity.  During 
the  periods  of  nonnasal  sounds,  the  velum  closes  off  the  nasal  tract  from  the  vocal  tract. 
Thus,  the  acoustic  tube  only  exhibits  poles  in  its  transfer  function.  When  the  velum  is 
lowered,  the  vocal  tract  is  acoustically  coupled  with  the  nasal  tract,  forming  a pole-zero 
system.  As  the  tube  varies  the  shape  for  different  sounds,  the  resultant  transfer  function  is 
such  that  it  emphasizes  certain  frequency  components  of  the  glottal  wave  and/or 
de-emphasizes  others.  The  resonant  peaks  of  the  speech  output  due  to  the  poles  arc  referred 


The  earliest  efforts  of  speech  research  were  directed  to  exploring  the  physiological 
nature  of  the  human  phonatory  system.  At  that  time,  the  speech  synthesizers  played  a 
fundamental  role  in  learning  the  process  of  speech  production.  The  talking  machine, 
designed  by  von  Kempelen  in  1791,  contained  a bellows  which  supplied  air  to  a reed 
(Flanagan.  1972b);  the  bellows  and  the  reed  were  obviously  used  to  simulate  the  lungs  and 
the  vocal  folds  respectively.  A hand-varied  resonator  was  provided  to  simulate  the  acoustic 
response  of  the  vocal  tract  This  machine  was  reported  to  produce  only  a few  vowels. 
Modem  speech  synthesizers  are  electrical  in  nature.  Technologies  developed  over  this 


i acoustic  tube  of  nonuniform  shape  varying  t 


1.2  Previous  Ressaich.on  Speech  Producuflii 


ephisticated  techniques  which  greatly  improved  the  quality  of 


synthetic  speech.  Such  a technological  evolution,  accompanied  with  the  emerging 
understanding  of  speech  acoustics,  gradually  shifted  the  focuses  and  interests  of  speech 
synthesis  to  other  applications.  The  most  significant  influence  was  the  Vocoder  invented  by 
Dudley  (Dudley.  1939),  whose  efforts  spawned  a subfield  of  communication  engineering. 
Research  in  this  subfield  was  aimed  at  the  efficient  encoding  and  transmission  of  speech 
information.  The  techniques  of  interest  were  directed  toward  obtaining  acceptable  quality 
at  low  bit  rates,  using  reasonable  computational  resources  in  a real-time  environment 
Research  issues  encompassed  methods  to  improve  quality,  robustness,  delay  and 

As  speech  synthesis  techniques  have  continued  to  improve  in  recent  years,  many 
speech  synthesizers  have  been  employed  to  implement  voice  response  systems  for 
computers,  which  arc  called  the  “text-to-spcech"  techniques.  Speech  synthesis  in  the  sense 
of  "text-to-spcech”  means  automatically  producing  voice  response  according  to  a text  input. 

the  visually  impaired. 


1.3  Models  for  Speech  Synthesis 

This  research  is  directed  toward  improving  the  speech  production  model.  We  define 
the  term  "speech  analysis"  as  the  procedure  used  to  extract  the  speech  production  model 
parameters  from  thespeech  signal  and  "speech  synthesis"  as  the  procedure  used  to  reproduce 
the  acoustic  speech  signal  by  controlling  and  updating  the  appropriate  parameters  obtained 
from  the  speech  analysis. 

Modern  speech  synthesizers  can  be  classified  into  two  groups,  one  based  on  the 


Fourier  transform  methods  i 


e-filter  model. 


Fourier  Model 


The  Fourier  transform  has  traditionally  been  used  to  study  speech  signals  because  it 
provides  a frequency  domain  analysis  of  the  phonatory  and  auditory  properties  of  speech 
signals.  Using  the  Fourier  model,  the  speech  signal  is  analyzed  using  short-time  Fourier 
analysis  (STFA),  while  synthesis  is  carried  out  by  an  inverse  transform  (Allen,  1977;  Allen 
and  Rabiner,  1977).  The  term  "short-time"  implies  that  the  speech  spectrum  is  stationary 
over  a short  interval  of  time.  This  is  a valid  approach  to  speech  processing  because  many 
psychoacoustic  and  physiological  studies  have  shown  that  the  human  ear  performs  a type  of 

The  channel  vocoder  is  the  oldest  form  of  speech  coding  device  that  exploits  Fourier 
analysis  and  synthesis  (Dudley.  1939).  This  vocoder  is  constituted  by  several  bandpass 
filters,  each  of  which  is  employed  to  preserve  the  magnitude  Fourier  transform  of  the  speech 
signal  within  a specific  band.  An  additional  channel  is  needed  to  transmit  other  information 
regarding  the  excitation,  e.g.,  the  voiced/unvoiced  signal  and  the  pitch  period  for  voiced 
speech.  Consequently,  the  concept  of  source  excitation  was  incorporated  into  the 
configuration  of  the  channel  vocoder. 

Another  Fourier-based  model  that  has  experienced  popularity  is  the  phase  vocoder 
(Flanagan  and  Golden,  1966).  The  major  success  of  this  technique  originates  from  a polar 
representation  of  the  Fourier  transformation,  i.c.,  phase  and  amplitude,  which  leads  to  an 
economy  of  transmission  bandwidth.  Unlike  the  channel  vocoder,  which  neglects  the  phase 
spectrum,  the  phase  vocoder  exploits  the  phase  information  through  the  derivative  of  the 
phase  spectrum.  Furthermore,  it  provides  flexibility  for  expending  and  compressing  the  time 
scale  through  the  manipulation  of  the  instantaneous  frequency.  Emerging  from  a similar 
idea,  a new  class  of  models  called  “sinusoidal  coders”  were  developed  and  have  proliferated 
since  thccarly  1980s  (Hcdelin,  1981;  Almeida  and  Tribolet,  1982;  Almeida  and  Silva,  1984; 
McAulay  and  Quatieri,  1984;  Trancoso  el  a!.,  1990).  For  such  coders,  the  speech  signal 


wiihin  each  frame  is  represented  by  a superposition  of  sinusoids  with  time-vaiying 

m = 2 at(r)cos^(r).  0 < l<  T (1-1) 

where  n is  the  number  of  sinusoids,  at(t)  is  the  amplitude  of  the  tih  sinusoids.  <pt(t)  is  the 
corresponding  frequency  and  Tis  the  frame  length.  The  variation  of  amplitudes,  at's,  and 
phases,  fa's,  within  ashort  interval  is  usually  described  by  first-  and  third-order  polynomials 

«,(')  =At  + (r/7)flf  (1-2) 

d>i(r)  = c,/  + cj/  + e14r  + c<*.  (1-3) 

These  polynomials  are  then  applied  to  an  interpolation  rule  for  the  instantaneous  values  of 
amplitude  and  phase  as  well  as  frequency.  With  a 10  ms  frame,  speech  quality  obtained  using 
this  model  is  virtually  indistinguishable  from  the  original. 

Among  the  sinusoidal  codeis,  approaches  for  processing  the  Fourier-based 
parameters  can  be  divided  into  two  classes.  Members  in  one  class  separate  the  pitch 
harmonics  from  the  spectral  envelope,  and  only  apply  the  sinusoidal  processing  techniques 
to  the  harmonics.  In  other  words,  the  at’s  in  Eq.  (1-1)  are  obtained  by  other  means  such  as 
linear  prediction  and  ceptrum  analysis.  Members  in  the  other  class,  on  the  other  hand, 
consider  the  at's  as  part  of  the  results  of  Fourier  analysis.  Interestingly,  the  generation  of 
noise  excitation  also  exhibits  two  different  forms,  ie„  either  white  noise  or  a signal  with 
random  phases  but  constant  amplitudes. 

1,3.2  Source-filter  Model 

The  source-filter  model  was  developed  by  Fant  in  the  late  1930s  (Fant,  1959;  Fan l, 
1960).  In  this  model  the  speech  signal  is  modeled  as  the  filtered  output  of  a network  excited 


by  quasi-periodic  pulses  for  voiced  speech  or  by  random  noise  for  unvoiced  speech.  The 
transfer  function  of  the  networic  is  defined  as  the  ratio  of  the  Laplace  transform  of  the  sound 
pressure  from  the  lips  of  the  speaker  to  the  volume  velocity  of  the  airflow  passing  the  vocal 
folds.  In  the  sense  of  speech  production,  a speech  signal  inherits  both  characteristics  of  the 


be  further  classified  into  three  categories,  namely,  the  LP.  formant  and  articulatory 


The  articulatory  synthesizer  is  a direct  approach  that  simulates  speech  production 
and  propagation  from  the  viewpoint  of  anatomy  and  physiology.  In  order  to  describe  the 
wave  propagation  by  means  of  aerodynamic  equations,  we  must  specify  such  parameters  as 
subglottal  pressure,  elasticity  of  the  vocal  folds  and  viscosity  of  the  vocal  tract,  in  addition 

down  into  a sequence  of  subsegments  of  constant  cross-sectional  areas,  the  complexity 

equivalent  circuit  such  as  an  analog  transmission  line  (Figure  1 -2).  Furthermore,  the  control 
system  for  driving  the  area  function  of  the  vocal  tract  was  developed  by  matching  the 


The  development  of  the  formant  synthesizer  is  mainly  based  on  the  perceptual 


source  and  the  network. 


Formed  based  on  the  source-filter  theory,  speech  synthesizers  < 


» of  the  model  to  those  of  real  speech. 


characteristics  of  the  human  auditory  apparatus.  In  this  type  of  synthesizer,  i 


Tongue  blade.  _ 


Figure  1-1.  Articulatory  model  of  human  vocal  tract  and  the  associated 
control  variables. 


Figure  1-2.  Equiv 


circuit  for  the  vocal  i 


function  is  directly  controlled  by  the  use  of  resonant  and  anti-resonant  filters,  whose  center 
frequencies  and  bandwidths  can  be  individually  specified.  The  resonant  filters  can  be 
connected  either  in  parallel  or  in  series  so  as  to  facilitate  the  production  of  both  nasal  and 
nonnasal  sounds.  An  excitation  model  resembling  the  natural  excitation  is  used  to  provide 


The  LP  synthesizer  consists  of  an  excitation  source  and  a lime- varying  all-pole  filter 
(Figure  1-4).  The  all-pole  filter  determines  the  spectral  envelope  of  the  synthesized  speech, 
and  the  excitation  source  provides  the  line  structure  of  the  spectrum  harmonics.  The  all-pole 

autoregressive  process,  that  is.  the  current  sample  is  a linearly  weighted  sum  of  previous 

spectrum  of  speech  signals.  Since  the  human  ear  is  mostly  sensitive  to  the  magnitude 
spectrum  of  an  acoustic  signal,  the  ability  of  preserving  the  spectral  envelope  by  the  LP 
analysis  is  the  main  reason  for  its  success. 

The  representation  for  the  spectral  envelope  based  on  LP  analysis  also  has  many 
implications  to  other  types  of  vocodets.  For  instance,  the  ceptrura,  which  is  obtained  from 
a homomorphic  system  (Figure  1-5),  is  an  alternative  form  that  manifests  the  short-time 
speech  spectrum  (Oppenheim,  1969).  The  impulse  response,  V computed  from  the 
ceptrum,  can  be  considered  as  the  coefficient  sequence  of  an  FIR  filter  exhibiting  a similar 
spectral  envelope  of  the  all-pole  filter: 

"W-i*-— - M 

where  at  is  the  LP  coefficient,  and  G is  the  gain.  The  filtering  operation  applied  to  the  source 


Figure  1-3.  Block  diagram  of  formant  synthesizer. 


Figure  1-4.  Block  diagram  of  LP  synthesizer. 
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Figure  1-5.  Block  diagram  ofhomomo 
(b)  synlhesis  process. 


orphic  system:  (a)  analysis  process. 


L2.4  Comment  on  ihe  Ibae  types  of  synthesizers 


The  advantages  and  disadvantages  of  the  three  types  of  synthesizers  are  given  in 
Table  1-1.  For  some  LP  synthesizers,  the  deficiency  of  the  all-pole  filter  is  ameliorated  by 
employing  a pole-zero  model  (Atal  and  Schrocdcr,  1978;  Childers  et  al„  1981).  Also,  the 
excitation  function  may  be  replaced  by  sophisticated  pulses  or  innovative  ensembles  that 
simulate  the  residue  signal.  (This  is  discussed  in  detail  in  Section  3.1.)  Moreover,  an 
independent  control  of  the  spectral  characteristics  is  achieved  by  factoring  the  filter  into 
resonators  and  anti-resonators  (Kuwabara,  1984,  Childers  etal.,  1989b).  For  some  formant 
synthesizers,  the  dynamics  of  spectral  characteristics  are  enhanced  simply  by  inserting 
features  into  the  glottal  source  (Fujisaki  and  Ljungqvist,  1986).  Likewise,  the  effect  of 
source-tract  interaction  may  be  simulated  by  cither  modifying  the  glottal  waveshape  or  the 
formant  bandwidlhs  or  by  incorporating  a control  circuit  (Gudrin  et  al.,  1976;  Yea  et  al., 
1983;  Fujisaki  and  Ljungqvist,  1986;  Wong,  1991). 

Since  each  of  the  above-mentioned  scheme  increases  the  computational  burden  of 
the  processing  task,  the  complexity  is  no  longer  a major  drawback  only  for  the  articulatory 
synthesizer.  Besides,  in  many  articulatory  synthesizers  the  movements  of  die  articulators  are 
determined  by  comparing  the  formants  of  the  synthetic  speech  with  that  of  the  original 
speech  (Rarthasarathy  and  Coker,  1992;  Prado  el  al.,  1992).  Consequently,  one  type  of 
source-filter  synthesizer  is  not  particularly  different  from  the  others. 

1.4  Research  Issues  and.Qliiaai.Yss 

Ideally,  a speech  synthesizer  should  have  the  ability  to  produce  any  desired  voice 
quality.  From  this  standpoint,  the  attributes  of  voice  quality  arc  directly  related  to  the  control 
parameters  of  a synthesizer.  In  other  words,  these  control  parameters  may  verify  the  quality 
attributes  that  the  human  ear  uses  to  discriminate  voice  types.  This  use  of  a speech 
production  model  for  speech  research  is  called  “analysis-by-synthcsis." 


Our  ultimate  objective  in  this  dissertation  was  to  develop  a high-quality  speech 
synthesizer  The  quality  of  speech,  in  general,  is  referred  to  as  the  total  auditory  impression 
the  listener  experiences  upon  hearing  the  speech  of  a speaker.  It  consists  of  two  factors, 
namely,  naturalness  and  intelligibility.  Through  the  progressive  understanding  of  speech 

speech  signals  depends  largely  on  the  vocal  tract,  while  the  source  characteristics  determine 
the  naturalness  of  the  voice.  The  intelligibility  is  not  our  concern  heresincemosl  present-day 
synthesizers  are  capable  of  conveying  the  intended  speech  content  correctly.  Instead,  we  are 
more  interested  in  the  vocal  source  because  of  its  contributions  to  the  naturalness  of  speech. 
For  this  reason,  the  words  "quality"  and  "naturalness"  will  be  considered  equivalent  in  this 
dtssenauon. 

separate  but  related  phases.  In  the  first  phase  we  discussed  how  to  obtain  acoustic  measures 
by  LP  techniques.  Three  types  of  voiced  speech  (modal,  vocal  fry,  breathy)  were  used  as 
representative  examples  to  illustrate  the  source  properties. 

We  have  selected  the  LP  technique  to  accomplish  this  study  despite  some  arguments 
against  the  use  of  such  a technique.  The  LP  analysis,  in  our  opinion,  is  more  than  adequate 
because  source  properties  are  all  extractable  from  the  residue  signal  obtained  by  inverse 
filtering  of  the  speech  signal.  This  argument  becomes  clear  in  Chapter  2 when  we  discuss 
the  relationship  between  the  residue  and  the  volume-velocity  flow.  We  identify  the 
significance  of  acoustic  measures  extracted  from  both  the  residue  and  speech  signals  and 
subsequently  correlate  these  measures  to  the  control  parameters  of  a speech  production 

The  knowledge  gained  in  the  first  phase  of  the  research  is  useful  for  the  design  of  a 
source  model  for  an  LP  synthesizer.  It  has  long  been  known  that  the  lack  of  glottal 
characteristics  is  the  primary  reason  leading  to  the  poor  quality  of  LP  synthesizers.  In  the 


signal.  Such  a source  model  will  be  presemed  in  the  form  of  a codebook  that  will  be 
incorporated  into  a newly  designed  speech  production  model.  In  addition  to  source 
modeling,  other  factors,  such  as  the  interpolation  of  LP  coefficients,  turbulent  noise. 


methods  and  strategies  to  deal  with  these  factors. 

The  efficacy  of  the  source  and  speech  production  models  is  determined  by  evaluating 
the  quality  of  synthetic  speech.  We  have  taken  the  analysis-by-synthesis  approach  in 
studying  speech  quality.  This  approach  provides  information  about  whether  the  important 
acoustic  features  are  successfully  maintained  during  the  modeling  process.  While  no  reliable 
quantitative  measure  is  available  for  performing  speech  evaluation,  informal  subjective 
listening  tests  were  conducted  to  assess  the  quality  of  the  synthetic  speech  samples.  The 
overall  research  plan  is  presented  as  a schematic  diagram  in  Ftgure  1-6. 


Chapter  2 describes  the  procedures  for  measuring  vocal  source  properties  by  linear 
predictive  analysis.  Following  a retrospect  of  some  existing  acoustic  measures,  our  first 
focus  is  on  sorting  the  relationships  between  these  measures  and  the  control  parameters  of 
a comprehensive  speech  production  model.  In  particular,  under  the  guidance  of  this  model, 
we  propose  methods  for  identifying  and  isolating  the  acoustic  characteristics  of  vocal 
quality.  Three  voice  types  are  provided  as  representative  examples  to  illustrate  the  proposed 
model  and  analysis  techniques.  Knowledge  gained  in  this  chapter  contributes  to  the 
understanding  of  general  problems  of  source  modeling  and  speech  processing,  which  we 
present  in  Chapters  3 and  4. 

Chapter  3 deals  with  the  modeling  of  the  excitation  source.  Depending  on  the  voicing 


Thus. 


i of  the  glottal  phase 


Figure  1-6.  Schematic  diagram  of  research  plan. 


Bolh  types  of  excitations  are  formulated  into  codebooks.  Our  methods  of  generating  the 
codebooks  arc  associated  with  these  two  types  of  excitations  individually. 

The  linear  predictive  analysis  and  synthesis  schemes  used  in  this  study  constituted 
the  first  two  pans  of  Chapter  4.  Issues  such  as  the  voicing  decision,  Glottal  Closure  Instant 

detennination  are  addressed.  The  overall  performance  of  these  schemes  is  dependent  on  how 
closely  the  reproduced  speech  resembles  the  original.  While  no  reliable  objective  quality 
measure  is  currently  available,  we  evaluate  the  synthetic  speech  by  informal  listening  tests. 

Chapter  S,  the  last  chapter,  summarizes  the  results  of  this  study,  discusses  possible 
improvements  to  the  proposed  model  and  finally  recommends  some  potential  applications. 


CHAPTER  2 
SOURCE  PROPERTIES 


A better  understanding  of  speech  production  is  important  for  the  assessment  of 
speech  quality  as  well  as  for  the  development  of  a natural-sounding  speech  synthesis  model. 
In  this  chapter  we  ate  particularly  interested  in  the  glottal  source  properties  that  affect  the 
perceptual  quality  of  the  voice.  The  elucidation  of  the  relationship  between  the  excitation 
source  and  the  resultant  speech  quality  requires  source-related  parameters  to  describe 
acoustic  and  perceptual  features,  as  well  as  methods  to  extract  the  parameters.  The 
analysis-by-synthesis  technique  is  a general  approach  to  speech  analysis  (Rabiner  and 
Schafer,  1978;  Furui,  1985).  In  principle,  we  establish  the  speech  production  model  and 
then  derive  the  model  parameters  used  to  reproduce  speech  signals.  Speech  synthesis,  in 
conjunction  with  perceptual  evaluation,  plays  a role  in  validating  the  significance  of  the 
acoustic  features  in  terms  of  the  model  parameters.  As  the  speech  production  models 
become  more  and  more  sophisticated,  many  detailed  acoustic-perceptual  correlations  will 
be  easily  verified  by  the  analysis-by-synthesis  approach. 

model  parameters  and  acoustic  features  measured  from  the  speech  signal.  Following  a brief 

techniques,  we  discuss  the  relationship  between  two  commonly  encountered  source 
excitation  signals,  namely,  the  residue  signal  and  the  differentiated  glottal  flow  waveform. 
In  order  to  facilitate  the  acquisition  of  the  source  excitation,  we  have  used  the  LP  technique 
as  a vehicle  to  complete  this  research.  A new  LP  synthesis  model  with  appropriate  source 
features  is  then  proposed.  Nine  utterances  of  three  types  of  phonalions,  i.e.,  modal,  vocal 


proposed  model. 


Basically,  researchers  have  used  five  types  of  acoustic  measures  to  study  vocal 

(1)  Perturbation  measures, 

(2)  Characteristics  of  the  glottal  flow  waveform. 

(3)  Vocal  noise, 

(4)  Roots  of  the  inverse  vocal  tract  filter, 

(5)  Vocal  intensity. 


Voiced  speech  is  generated  by  the  vibration  of  vocal  folds.  Aberrant  vibratory 
patterns  of  vocal  folds  has  long  been  known  to  result  in  abnormal  or  deviant  voices  (Moore, 
1976).  Statistical  properties  of  the  cycle-to-cycle  variations  in  voiced  speech  have  proven 
useful  to  characterize  vocal  quality  (Askenfelt  and  Hammarbcrg  1986;  Schoentgen  1989; 
Pintoand’ntzcl990;Eskenazictal.  1990).  The  perturbations  in  the  fundamental  frequency 
and  amplitude  of  sustained  utterances,  termed  jitter  (Lieberman,  1961)  and  shimmer  (Koike. 
1969),  respectively,  were  two  of  the  first  acoustic  measures  reported  to  be  correlated  with 
vocal  pathology.  Since  then,  other  perturbation  measures  have  also  been  shown  to  be  capable 
of  distinguishing  pathological  from  normal  voices. 


The  characteristics  of  the  glottal  flow  considered  for  the  assessment  of  speech  quality 
can  be  further  classified  into  two  categories:  (1)  qualitative  analysis  based  on  parameters  of 


2. 1.2  Chi 


source  models,  and  (2)  spectral  tilt. 


2.1.2.1  Ou 


; analysis  based ' 


Moniioring  [he  gloual  flow  waveform  is  a direct  means  of  studying  the  variations  of 
the  glottal  source  (Hillman  and  Weinberg,  1981;  Javkin  et  al„  1987;  Price,  1989).  In  order 
to  assess  such  variations  on  a quantitative  basis,  a parametric  model  was  often  introduced. 
One  such  model  that  has  been  widely  adopted  for  quality  assessment  in  recent  years  is  the 
LF  model  (Fant  et  al„  198S;  Fujisaki  and  Ljungqvist,  1986;  Fant  and  Lin.  1988;  Gobi.  1988 
& 1989;  Karlsson.  1988;  Ahn,  1991;  Tenpaku  and  Hirahara.  1990;  Childers  and  Lee.  1991). 
This  model  is  useful  because  it  ensures  an  overall  lit  to  commonly  encountered  differential 
glottal  pulses  withaminimumnumberof  parameters,  and  it  is  flexible  in  the  extent  to  which 

2. 1.2.2  Spectral  till 

In  addition  to  the  parametric  variations  of  the  source  models,  the  spectral  lilt  of  the 
glottal  flow  appears  to  be  characteristic  of  different  voice  types  (Hollien,  1974;  Hiki  et  al., 
1976;  Monsen  and  Engcbretson,  1977).  In  fact,  the  steepness  of  the  spectral  tilt  is  caused 
by  the  rapidity  of  the  closing  phase  and  by  the  abruptness  of  the  glottal  closure.  The 
perceived  quality  of  speech  is  related  to  the  spectral  tilt  (Childers  and  Lee.  1991).  A steeply 
declining  spectral  tilt  results  in  a lax  quality,  whereas  a gradually  declining  tilt  produces  a 
tense  quality.  To  achieve  a quantitative  measure  of  this  aspect  of  vocal  quality,  the  spectrum 
of  the  glottal  flow  is  usually  approximated  by  a three-pole  model,  or  equivalently  a two- pole 
model  for  the  differentiated  glottal  flow.  The  coefficients  of  the  three-  or  two-pole  models 


Turbulence  at  the  level  of  the  glottis  also  contributes  vocal  quality  such  as  hoarseness 
and  breathiness,  which  is  a prominent  symptom  of  laryngeal  pathologies  (Klatt,  1987;  Klatt 
and  Klatt,  1990;  Childers  and  Lee,  1991).  Methods  formeasuring  the  turbulent  noise  consist 


ofthe  relative  intensity  (Hiraokactal.,  1984;Fukazawaetal„  1988),  the  spectral  noise  level 
and  the  harmonic-to-noise  ratio  (Kitajima,  1981;  Yuraoto  et  al.,  1982;  Yumoto  et  al.,  1 984; 
Kasuya  et  al.,  1986  a&b;  Muta  et  al„  1987;  Childers  and  Lee,  1991).  In  most  cases,  these 
noise  measures  were  influenced  by  the  spectral  content  of  the  analyzed  speech. 
Consequently,  better  methods  are  needed  so  that  the  glottal  flow  waveform  can  be  analyzed 
more  precisely. 


Another  aspect  of  speech  spectra  that  affects  the  detection  of  laryngeal  dysfunction 
has  been  demonstrated  by  Deller  and  Anderson  (1980),  who  represented  the  speech  signal 
by  the  roots  of  the  inverse  filter  and  then  applied  pattern  recognition  techniques  to 
dichotomize  the  subjects  as  either  normal  or  pathological.  It  was  found  that  the 
discrimination  function  employed  in  detecting  laryngeal  behavior  was  more  sensitive  to  the 
poles  attributable  to  the  glottal  source  than  to  the  formant  structure  (Deller,  1982).  This 
technique  was  later  applied  to  the  EGG  signal  by  Smith  and  Childers  (1983)  as  a method  for 
detecting  laryngeal  pathology.  They  concluded  that  the  LP  features  of  EGG  signals  were 
more  sensitive  to  pathology  detection  than  similar  parameters  measured  from  speech  signals. 
Recently,  the  same  task  was  recast  on  the  pattern  analysis  of  LP  coefficients  by  vector 
quantization  (Childers  and  Bae.  1992).  Inferences  based  upon  their  results  were  consistent 
with  previous  research. 


Vocal  intensity  is  less  specific  in  quality  assessment  (Colton,  1973;  Hollien,  1974). 
It  is  largely  irrelevant  to  the  perceived  quality  except  for  loudness.  Since  the  selected  speech 
samples  we  used  were  approximately  at  the  same  power  level  after  digitization,  vocal 
intensity  was  not  considered  an  important  factor  in  our  research. 


2.1.6  Remark 


predictor  that  matches  well  with  the  objective  evaluation  (Hiki  cl  al..  1976;  Wolfe  and 
Stcinfau,  1987;Eskenazietal.,  1990;  Pinto  and  Titze,  1990).  In  addition,  measures  of  higher 

measures  in  quality  assessment  causes  difficulties  in  justifying  the  significance  of  each 
individual  measure  and  their  correlations.  Pinto  and  Titze  ( 1 990)  made  an  attempt  to  unify 
existing  jitter,  shimmer  and  noise  measures;  however,  no  effort  was  made  to  sort  out  the 
relation  between  the  acoustic  measures  and  the  control  parameters  of  a specific  speech 
production  model.  This  motivated  us  to  explore  those  relations. 

2,2-Gloial- Inverse  Filleting 

Glottal  inveise  filtering  is  a popular  and  efficient  means  for  investigating  the 
activities  of  the  glottal  source.  It  is  based  on  the  assumptions  that  the  source  excitation  and 
the  supraglottal  loading  are  separable  and  that  the  source  properties  of  the  speech  production 
model  can  be  uniquely  determined.  The  principle  of  inverse  filtering  is  to  obtain  the  glottal 
flow  by  eliminating  the  effects  of  vocal  tract  transfer  function  and  lip  radiation  from  the 
speech  signal.  Figure  2-1  presents  the  conceptual  inverse  filtering  model.  Notice  that  in  this 
representation  the  sequence  of  the  vocal  tract  transfer  function  and  lip  radiation  are  reversed 
because  the  speech  production  is  assumed  to  be  a linear  model. 

Current  methods  for  glottal  inverse  filtering  center  on  LP  analysis  (Bcrouti,  1976; 
Wong  et  al„  1979;  Matausek  and  Batalov,  1980;  Childers  and  Larar,  1984;  Krishnamurthy 
and  Childers,  1986;  Milenkovic,  1986;  Childers  and  Lee,  1991).  Among  the  various 
methods,  the  closed-phase  covariance  analysis  is  considered  the  most  reliable  because  no 
source-tract  interaction  is  involved.  However,  the  disadvantages  of  this  method  are:  (1)  it 
needs  to  locate  the  closed  phase  very  accurately,  and  (2)  it  is  only  feasible  when  the  close 
phase  is  long  enough  to  accommodate  the  analysis  window. 


differential 


G(z) : glottal  shaping  filter 
V(z) : vocal  tract  transfer  function 
R(z) : lip  radiation 


Figure  2-1.  Block  diagram  of  glottal  in 


Recently,  in  order  to  alleviate  these  disadvantages,  adaptive  approaches  have  been 
used  to  track  the  rapid  change  of  the  parameters  of  the  vocal  tract  during  the  glottal  closed 
phase  (Ting  and  Childers,  1990).  In  fact,  it  is  more  convenient  to  estimate  the  composite 
effect  of  the  glottal  pulse,  lip  radiation  and  vocal  tract  together.  The  vocal  tract  transfer 
function  could  be  obtained  by  removing  the  source-related  roots  from  the  LP  polynomial 
(Childers  and  Lee,  1991).  However,  this  approach  may  introduce  errois  due  to  the  incorrect 
elimination  or  merging  of  such  roots.  Furthermore,  since  the  estimate  of  the  vocal  tract 
parameters  is  based  on  an  entire  pitch  period,  the  effects  of  different  damping  factors  caused 
by  the  open  and  closed  glottal  intervals  during  a pitch  period  affect  the  estimate. 
Consequently,  the  estimated  glottal  flow  waveform  becomes  an  "average"  waveform  for  the 
entire  pitch  period.  This  average  waveform  may  not  be  truly  representative  of  the  actual 

From  the  discussion  above,  we  know  that  the  glottal  flow  waveform  is  not  always 
obtainable  using  the  glottal  inverse  filtering  techniques.  However,  the  estimation  of  residue 
signal  is  seldom  affected  by  the  preceding  factors.  Moreover,  as  will  be  seen  in  the  next 
section,  the  retrieval  of  the  glottal  phase  characteristics  can  be  resolved  from  the  residue 
signal.  For  these  two  reasons,  we  concentrate  our  study  on  the  residue  signal.  The  potential 
of  the  residue  can  be  seen  from  its  appearance.  It  has  been  observed  that  the  residue  extracted 
from  normal  voices  consists  of  periodic  sharp  spikes  and  low-level  noise  components, 
whereas  the  residue  extracted  from  deviant  voices  exhibits  a less  distinctive  pattern  of 
periodic  spikes  (Figure  2-2).  Because  such  an  observation  is  not  as  noticeable  as  in  the 
speech  signal,  many  researchers  advocated  the  use  of  the  residue  signal  over  the  speech 
signal  for  the  analysis  of  abnormal  voices  (Koike  and  Markel,  1975;  Sorensen  and  Horii, 
1984;  Prosek  et  al,  1987).  Ironically,  the  quantitative  measures  deduced  from  the  residue 
signal  failed  to  support  their  claims  (Schoentgen,  1 982).  We  believe  this  contradiction  is  due 
to  the  inadequacy  of  the  acoustic  measures  and  the  analysis  methods.  It  was  noted  that  the 
LP  coefficients  calculated  by  a fixed-frame  autocorrelation  method,  which  was  used  by 


(a)  (b) 


Figure  2-2.  Speech  and  residue  waveforms  for  two  normal  subjects  (a)  and  (b).  and  for 
two  pathological  subjects  (c)  and  (d).  The  pathological  symptom  is  hoarseness 
for  subject  (c)  and  is  bilateral  paralysis  of  TVC  for  subject  (d). 


1(1982).  were  affected  by  the  size  and  position  of  the  analyzed  frame.  Any  small 


deviation  of  the  estimated  coefficients  could  result  in  a groat  change  of  the  residue  signal 
(Ananthapadmanabha  and  Ycgnanarayana,  1979).  Consequently,  the  acoustic  measures 
derived  from  the  fixed-frame  autocorrelation  method  are  prone  to  error.  To  avoid  this 
problem,  a pitch-synchronous  covariance  analysis  method  has  been  used  (Chandra  and  Lin, 
1974). 

2.3  Correlation  between  Residue  and  Differentiated  Olonal  Flow 

It  is  constructive  for  us  to  clarify  the  relation  between  the  residue  and  the  glottal  flow 
before  we  explore  the  characteristics  of  the  glottal  source.  As  shown  in  Figure  2-1,  the 
inverse  filtering  can  be  imaged  as  a process  of  unscrambling  the  speech  signal  so  as  to  obtain 
the  excitation  waveform.  One  of  the  intermediate  products  is  the  differentiated  glottal  flow, 
while  for  our  purpose  the  residue  signal  is  the  ultimate  result.  Thus,  the  correspondence 
between  the  residue  and  glottal  flow  can  easily  be  illustrated  as  a filtering  process.  Here  we 
adopt  a two-pole  filter  to  model  the  spectrum  of  the  differentiated  glottal  flow.  The  filter 
coefficients  are  obtained  by  an  LP  analysis  of  the  modeled  LF  waveform. 

Since  the  LF  model  has  been  successfully  used  to  describe  the  characteristics  of  a 
differentiated  glottal  flow,  we  adopt  it  as  an  explanatory  media  for  the  subsequent  discussion. 
The  equations  of  the  LF-model  arc  given  as 


where  ip.  fe,  ic  are  parameters  related  to  the  glottal  flow  peak,  maximum  closing  rate  and 
glottal  closure,  respeedvely.  The  parameter  la  is  used  to  control  the  abruptness  of  return 
phase,  and  the  parameter  tog,  defined  as  Jt lip,  determines  the  frequency  of  sinusoid. 
Parameters  E(, , a and  e arc  for  computadonal  use  only.  A typical  LF-model  waveform  is 
shown  in  Figure  2-3. 


Figure  2-3.  LF-raodel' 


Eforra,  £(!),  fo 


Jiffcreniiaied  gloiial  I 


The  first  segment  of  the  LF  model  characterizes  the  differentiated  glottal  flow  over 
the  interval  from  the  glottal  opening  to  the  maximum  negative  excursion  of  the  waveform. 
The  second  segment  represents  a residual  glottal  flow  that  comes  after  the  maximum 
negative  excursion.  It  can  be  shown  from  Eq.  (2-1 ) that  the  spectrum  of  the  fust  segment 
is  dominated  by  the  exponential  component,  e®.  of  which  the  "negative  bandwidth"  equals 
ala.  Likewise,  the  frequency  response  of  the  second  segment  can  be  approximated  by  a first 
order  lowpass  filler  with  a cutoff  frequency  F„  = l/(2ar  la)  (Fan!  and  Lin,  1988).  As  a result, 
the  bandwidths  of  the  first  and  second  segments,  fli  and  Bs.  are 


It  can  be  shown  that  the  poles  of  the  filter  are  either  both  real  or  a complex  conjugate 
pair.  The  center  frequency  o>  and  bandwidth  B can  be  calculated  from  the  zeros,  o’s,  by 


We  have  found  that  the  center  frequency  to  of  the  poles  and  oiK  of  the  LF  model  are  nearly 
the  same.  Thus,  we  are  only  concerned  with  the  change  in  bandwidths  of  the  poles  of  the 
inverse  filter.  The  bandwidth  of  source  spectrum  B is,  in  general,  very  close  to  Bi,  causing 
the  waveshape  of  the  first  segment  to  be  obliterated  after  inverse  filtering.  However,  fl2  is 
much  higher  than  B.  The  second  segment  thereby  retains  its  waveshape  after  inverse  filtering 
although  the  resultant  phase  may  be  different  from  the  original.  A typical  example  is  given 
in  Figure  2-4,  which  displays  the  spectra  of  the  first  and  second  segments  of  LF-model  as 
well  as  the  corresponding  spectrum  of  the  two-pole  model.  As  a result,  the  residue  derived 
from  the  LF-model  waveform  has  a flat  spectrum  envelope  and  exhibits  a sharp  pulse  at  the 
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Figure  2-4.  Effects  of  the  inverse  filter  imposed  on  the  differentiated  glottal  waveform: 
(a)  LF-model  waveform,  (b)  FFT  spectra  of  LF-model  waveform  and  two-pole 
model.  H(z),  (c)  FFT  spectra  of  individual  segments  of  the  LF-model. 
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residue  signals  should  enable  us  io  relrieve  one  signal  from  ihe  other.  Although  the  residue 
signal  does  not  appear  to  be  highly  infoimative,  its  integral,  in  contrast,  tends  to  partially 
re-exhibit  the  shape  of  the  first  segment  of  the  differentiated  glottal  flow.  Thus,  the  analysis 
strategies  based  on  the  differentiated  glottal  flow  can  be  transplanted  to  the  integrated 
residue  with  little  modification.  To  support  our  claim,  we  perform  the  inverse  filtering  based 
on  a synthetic  vowel,  as  shown  in  Figure  2-5,  so  that  the  similarities  and  contrasts  between 
the  LF-model  waveform  and  the  integrated  residue  can  be  noticed  readily. 


2.4  Choice  of  Model  Type 

In  essence,  building  a speech  model  is  equivalent  to  systematically  coordinating  the 
acoustic  and  perceptual  attributes  into  a joint  construct.  For  speech  synthesis,  modeling  is 
asually  aimed  at  the  parameterization  of  the  voice  source  and  the  vocal  tract. 

In  Chapter  1,  we  have  shown  that  the  speech  production  and  propagation 
mechanisms  can  be  described  by  a source-filter  model  (Fant,  I960),  consisting  of  the  glottal 

effective  in  characterizing  speech  signals.  The  formant  and  LP  synthesizers  belong  to  this 
group,  and  both  synthesizers  were  used  to  study  vocal  variability  (Rosenberg,  1971;  Holmes, 
1973;  Sambur,  et  aL,  1978;  Atal  and  David,  1979;  Kuwabara,  1984;  Hermansky  et  al.,  1985; 
Hedelin,  1986;  Klatu  1987;Mutaetal„  1987;Childersetal.,  1989b;  Childeis  and  Wu.  1990; 
Klatt  and  Klatt,  1990;  Childers  and  Lee,  1991 ; Lalwani  and  Childers,  1991). 

For  the  formant  synthesizers,  the  properties  of  each  component  of  the  source-filter 
model  are  elaborated  individually.  Usually,  the  lip  radiation  is  approximated  by  a 
differentiator.  The  vocal  tract  transfer  function  is  characterized  by  the  formants/ 
anti-formants,  which  are  implemented  by  resonator/anti-rcsonator  filters.  The  source  model 
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Figure  2-5.  ffluslration  of  ihe  similarily  between  the  differential  glottal  flow  and  the 
integrated  residue  signal.  Waveforms  from  top  to  bottom  are:  (1)  LF  model. 
(2)  synthesized  vowel  lil  produced  by  the  LF  model  (3)  residue  signal,  and  (4) 
integral  of  the  residue  signal. 


such  synthesizers  was  judged  to  be  satisfactorily  high,  provided  that  the  glottal  flow  is 
appropriately  modeled  (Klatt,  1980;  Holems,  1983;  Pinto  et  al„  1989). 

For  the  LP  synthesizers,  the  composite  frequency  response  of  the  glottal  flow,  vocal 
tract  and  lip  radiation  is  modeled  by  a slowly  time- varying  filter  (Atal  and  Hanaucr,  1971). 
The  associated  source  excitation  is  primarily  used  to  account  for  the  periodicity  of  the  glottal 
pulse,  in  other  words  the  pitch.  The  synthetic  speech  quality  for  many  previous  models  is 
considered  unnatural  due  to  an  oversimplified  excitation,  failures  to  properly  identify 
voicing,  and  poor  spectral  resolution  (Wong,  1980;  Kahn  and  Garst.  1983).  However,  with 

quality. 

On  the  whole,  it  appears  that  the  perceptual  quality  of  synthetic  speech  is  improved 
by  improving  source  excitation  models  for  both  LP  and  formant  synthesizers.  In  the  past 
the  lack  of  a physiological  interpretation  for  the  residue  was  considered  the  primary  obstacle 
against  the  use  of  LP  techniques  in  examining  a specific  vocal  quality,  making  the  formant 
synthesizer  a more  popular  tool.  Nonetheless,  this  argument  may  no  longer  be  valid  once 
we  are  able  to  verify  the  relationship  between  the  residue  and  the  glottal  flow.  Since  the  LP 
synthesizer  has  the  advantages  of:  ( 1)  computational  efficiency,  and  (2)  ease  of  obtaining  the 
residue  from  speech,  we  decided  to  use  the  LP  synthesis  model  as  the  means  to  accomplish 
this  research. 

We  start  by  integrating  the  acoustic  attributes  into  a comprehensive  model.  Our 
strategics  in  constructing  a high-quality  LP  speech  production  model  follow  the 
analysis-by-synthesis  rules.  This  speech  production  model  is  depicted  in  Figure  2-6,  while 
Figure  2-7  presents  the  correlations  we  are  going  to  examine  between  the  acoustic 
parameters  and  model  parameters.  Before  we  work  out  the  details,  there  is  much 
groundwork  to  establish. 
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Figure  2-7.  Correlations  beiwccn  acoustic  attributes  and  model  parameters; 


In  addition  to  the  description  of  the  experimental  data  base,  this  section  provides 
background  information  about  vocal  quality  including  modal,  vocal  fty,  and  breathy  voices. 
The  source  properties  quoted  from  previous  research  are  listed  for  the  purpose  of  comparing 
our  analysis  results.  After  explaining  the  measuring  methods,  we  discuss  the  pre-processing 
schemes  required  to  extract  the  acoustic  features.  These  preparations  constitute  the 
foundation  for  the  source  extraction. 


2,5,1  Experimental  Data  Base 


high-speed  laryngeal  photography.  Nine  utterances  for  three  different  voice  types,  i.e., 
modal,  vocal  fry,  and  breathy,  served  as  our  data  base.  All  these  utterances  were  categorized 
by  professional  speech  scientists.  A description  of  the  data  base  is  shown  in  Table  2-1. 


Table  2-1 . Data  base  for  speech  analysis. 

Subject  Sra  Voice  type 

Ml  M model 

M2  M model 

M3  M model 

VI  M vocal  fry 

V2  M vocal  fry 

V3  M vocal  fty 

B1  M breathy 

B2  M breathy 

B3  M breathy 


1 of  pilch  periods 
382 
331 
244 
176 
239 
109 
201 
273 
454 


During  speech  processing,  the  measures  were  performed  over  a steady-state  interval. 
As  will  be  discussed  later,  the  acoustic  measures  required  a precise  identification  of  the  pitch 
period.  An  additional  signal,  the  clectroglotiograph  (EGG),  was  employed  in  this  study  to 
aid  the  speech  processing.  We  sampled  the  speech  and  EGG  signals  at  10  KHz  with  16-bits 


precision.  Both  signals  were  digitized  simultaneously  using  Digital  Sound  Corp.  DSC-240 
preamplifier  and  a DSC-200  digitizers.  The  microphone  was  an  Electro- Voice  RE-10  held 
six  inches  from  the  lips.  Before  digitization,  the  signals  were  bandlimited  to  5 kHz  by 
anti-aliasing,  passive,  elliptic  filters  with  a minimum  stopband  attenuation  of  -55dB  and  a 
passband  ripple  of  ±0.2  dB.  All  data  recordings  were  collected  in  an  Industrial  Acoustics 
Company  (IAC)  single  wall  sound  booth.  To  compensate  for  the  microphone  characteristics 
at  low  frequencies,  the  frequency  response  of  the  speech  recordings  was  further  corrected 
using  a linear  phase  FIR  filler. 


The  adequacy  of  an  acoustic  measure  can  be  illustrated  by  its  capability  for 
characterizing  vocal  quality.  When  assessing  the  acoustic  measures,  we  certainly  need  to 
have  a general  concept  of  the  vocal  quality.  As  mentioned  in  the  previous  chapter,  the  vocal 
quality  is  referred  to  as  the  auditory  impression  the  listener  experiences  upon  hearing  the 
speech  of  another  talker.  Major  types  of  vocal  quality,  according  to  Laver  and  Hanson 
(1981).  are  model,  breathy,  vocal  fry,  falsetto,  harshness,  and  whisper.  We  excluded 
falsetto,  harshness  and  whisper  from  this  study  because  the  other  three  voice  types  were 
considered  sufficiently  representative  of  three  modes  of  vocal  fold  vibratory  patterns.  The 
qualitative  definitions  (Licberman  and  Blumstcin.  1988;  Eskcnazi  et  al„  1990)  of  the  three 
voice  types  are; 

Modal  : Defined  as  a normal  phonation.  A modal  phonation  is 
characterized  by  a moderate  frequency,  wide  lateral  excursions,  and 
complete  closure  of  the  glottis  during  about  one  third  of  the  entire 
pitch  period. 

Breathy  : Defined  as  audible  cscapage  of  air  through  the  glottis  due  to 
inversely  proportional  to  the  length  of  the  closed  glottal  phase. 


Vocal  fry  : Defined  as  a low-pitched,  creaky  kind  of  phonalion.  It  also 
shows  a great  deal  of  irregularity  from  one  pitch  period  to  the  next. 

In  this  study,  we  are  interested  more  in  the  source  characteristics  than  in  the  vibratory 
frequency  of  the  laryngeal  vibration.  This  is  because  the  effect  of  the  glottal  vibration  in 

Some  acoustic  features  of  glottal  factors  of  various  voice  types  are  summarized  in  Table  2-2. 
These  features  will  serve  as  references  when  we  examine  speech  features  using  proposed 


Source:  Ahn,  1991;  Childers  and  Lee,  1991. 


2.5.3  Analytical  Logics 


In  most  research  the  characteristics  of  glottal  flow  for  one  pitch  period  are  delineated 
using  variables  consisting  of  either  the  relative  timing  or  the  durations  of  special  events  such 
as  the  glottal  opening  and  closure.  Because  the  pitch  period  is  usually  a known  value,  it  is 


the  waveshape,  rather  than  the  absolute  timing,  that  has  attracted  the  researchers'  attention. 
This  suggests  a standardization  procedure  for  those  variables  based  on  the  underlying  pitch 
period.  Properties  of  the  standardized  variables  drawn  from  a large  population  are  assumed 
to  represent  general  characteristics  of  the  glottal  source.  Many  postulates  and  conclusions 
pertaining  to  vocal  quality  are  thereby  deduced  based  on  the  statistical  results.  Alternatively, 
such  a statistical  analysis  can  be  performed  by  evaluating  the  averaged  glottal  pulse  over  a 
large  number  of  sample  periods.  Such  a logical  variant  will  facilitate  the  inquiry  of  some 
timing  events  in  the  glottal  flow,  the  differentiated  glottal  flow  and  the  integrated  residue. 


To  perform  the  alternative  statistical  analysis  suggested  above,  we  resample  every 
pitch  period  at  a variable  rale  so  that  every  digitized  waveform  has  the  same  length.  In  other 
words,  the  sampling  rate  for  each  individual  pitch  period  should  be  different  in  order  to  make 
the  digitized  waveforms  summablc.  Difficulties  associated  with  this  procedure  lie  in  the 
identification  and  standardization  of  each  pilch  period.  A direct  and  exact  solution,  from  a 
mathematical  point  of  view,  is  the  Sine-interpolated  sampling  rate  conversion  (Schafer  and 
Rabiner,  1973;  Kroon  and  Atal,  1990;  Schumacher  and  Chafe,  1990).  The 
Sine-interpolation  is  given  by 

. sinjjr/Jrtr— i + 0]l 

HnT)  = V 40— ^ - (2-6) 

nf&nT  -f  + 0) 

where/,  is  the  sampling  frequency  of  the  original  sequence  40.  T is  a new  sampling  interval 
of  4") , and  6 is  a phase  offset  The  T and  0 for  each  individual  period  are  determined  so 
as  to  yield  a maximum  similarity  across  the  resampled  periods. 

The  processing  task  required  by  Sinc-interpolauon  is  computationally  expensive.  A 
simpler  approach  is  presented  below  to  facilitate  the  computation.  First  of  all.  we  interpolate 


the  analyzed  signal  s(n)  by  a factor  of  five  times  by  using  i 


; filter: 


4n)  = sfr) *A(n) 


(2-7) 


where  1*]  denote  the  convolution,  sj(n)  is  the  linearly  interpolated  data  sequence,  and  h(n) 
is  the  impulse  response  of  a lowpass  filter  with  the  cut  off  frequency  at  ti/5.  In  our  case,  a 
511-order  FIR  filter  designed  by  using  the  window  method  is  employed  to  avoid  phase 
distortion.  The  impulse  response  of  this  FIR  filter  is 


hM  = £h(n)  H(n) 


(2-8) 
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The  next  step  is  to  separate  each  individual  pitch  period  along  the  signal.  We  use  the 
two-channel  approach  (Krishnamurthy  and  Childers,  1986)  to  assist  this  processing 
automatically.  The  glottal  closure  instant,  which  is  signaled  by  a rapid  decrease  in  the  EGG, 
has  been  found  to  coincide  with  the  minimum  in  the  differentiated  EGG  (DEGG)  for  that 
period.  Thus,  we  can  locate  the  instant  of  glottal  closure  by  picking  the  negative  peaks  of 
the  DEGG  signal,  as  illustrated  in  Figure  2-8.  A pitch  period  is  then  defined  as  the  interval 

Due  to  the  propagation  delay  of  the  sound  wave  from  the  glottis  to  the  microphone, 
we  apply  a time  lag  of  0.9  msec  to  the  EGG  signal  to  achieve  synchronization  with  the  speech 
signal.  Also,  in  outer  to  improve  the  accuracy  of  the  locations  of  the  peaks,  we  employ  a 


quadratic  interpolation  method  (Markel  and  Gray,  1976;  Titze  et  al„  1987):  Let/(-l),/(0), 
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andjfl)  define  three  poinis  centered  at  apeak,  whcrc/tO)  corresponds  lo  adiscrele  minimum 
value,  and/H)  and /(l)  are  points  lo  the  left  and  right  of fift).  The  position  of  interpolated 
minimum,  K is  then  resolved  from  a second-order  approximation  among  three  points  by 


1 = /<°>  - /(■)-/(-  1) 

/'( 0)  2(f(l)  — 2/(0)  +/(—  1))' 


(2-12) 


Values  obtained  through  this  process  are  rounded  to  the  nearest  sample  of  the  re-sampled 
signal.  Thus,  the  resulting  resolution  of  each  pilch  period,  estimated  from  the  re-sampled 
sequence,  increases  by  approximately  five  times. 

Finally,  the  length  of  the  pitch  period  is  adjusted  to  512  samples  by  using  the  FFT 
method.  That  is,  depending  on  the  number  of  the  samples  in  the  interpolated  period,  we 
append  zeros  or  remove  the  high  frequency  range  of  the  FFT  sequence  lo  achieve  the 
intended  length  (512  samples  in  this  case).  The  fixed-length  signal  is  then  obtained  by  taking 
the  IFFT  of  the  resultant  FFT  sequence.  Notice  that  the  discontinuity  (linear  trend)  between 
two  boundaries  of  the  underlying  signal  must  be  removed  before  applying  the  FFT  method 
since  this  signal  has  to  be  circularly  periodical. 


2,6  Feature  Entaction 


Under  the  guidance  of  the  proposed  speech  production  model  (Figure  2-6).  we 
explore  the  relation  between  the  model  parameters  and  some  existing  acoustic  measures.  A 
pitch-synchronous  covariance  LP  analysis  is  adopted  to  estimate  the  spectral  properties  of 
the  speech  signal.  The  LP  order  is  chosen  to  be  14  to  account  for  the  spectral  tilt  of  the  glottal 
flow  and  the  number  of  formants  within  5 KHz  bandwidth.  Following  the  arrangement 
presented  in  the  survey  of  acoustic  measures,  we  illustrate  how  to  extract  model  parameters 
that  correspond  lo  the  perturbation  measures,  spectral  tilt,  phase  characteristics,  and  vocal 
noise  sequentially.  Theexamination  regarding  the  roots  of  the  inverse  filter  is  not  considered 
in  this  study. 


2.6.1  Perturbation  Measure 


As  mentioned  previously,  the  perturbation  measures  are  used  to  characterize  the 
vibratory  patterns  of  the  vocal  folds,  which  include  variabilities  in  the  pitch  period  and 
waveform  amplitude.  The  perturbation  measures  can  be  further  divided  into  two  types, 
namely,  subharmonics  and  random  noise  as  demonstrated  in  Figure  2-9.  The  subharmonics 
result  from  a repetitive  vibratory  pattern  extending  more  than  one  pitch  period,  while  the 
random  noise  represents  the  unpredictable  characteristics  of  the  vocal  fold  vibration.  To 
avoid  further  complicating  the  problem,  we  confined  our  research  to  random  noise  while 
illustrating  the  proposed  perturbation  measures. 

Because  each  subject  involved  in  this  experiment  was  instructed  to  utter  a steady 
vowel  with  a comfortable  intensity,  the  pilch  and  intensity  contours  of  recorded  speech  were 
considered  to  be  fairly  stable,  typical  pitch  and  intensity  contours  of  a modal  voice  arc 
shown  in  Figure  2— 10(a)  and  Figure  2-1 1(a).  As  we  are  interested  in  the  perturbation 
associated  with  the  measured  signal,  a proper  initial  step  is  to  obtain  the  corresponding 
deviation  by  removing  the  average  value.  By  inspecting  the  spectral  properties  of  the 
deviations  of  pitch  and  intensity  signals  (Figures  2— 10(b)  and  2-1 1(b)),  we  find  that  both 
spectra  are  relatively  fiat  except  at  the  region  of  low  frequencies.  This  finding  leads  us  to 
conjecture  that  the  deviation  signal  can  be  modeled  as  a slow  fluctuating  component 
accompanied  with  a white  noise  source.  The  low-frequency  component  in  the  deviation 
signal,  termed  “drift”  in  many  studies,  is  known  as  the  inherent  nature  of  human  speech. 
Though  the  dynamic  patterns  of  drift  determine  the  tune  in  speech,  they  are  unlikely  to 
provide  much  information  about  vocal  quality.  Thus,  we  are  safe  in  discarding  this  effect 
while  studying  vocal  quality.  In  fact,  as  pictured  from  the  point  of  view  of  the  filtering 
process,  most  perturbation  measures  were  introduced  to  eliminate  the  drift  or  to  emphasize 
the  white  noise  source.  Examples  for  some  perturbation  measures  and  their  mathematical 
relationships  can  be  seen  in  Pinto  and  Titze  (1990). 


Speech  (V3;  sample  range=|  16401: 194001) 


The  use  of  a highpass  filter  will  not  properly  separate  the  noise  source  from  the  drift 
because  it  removes  the  low-frequency  portion  of  the  noise  also.  Thus,  we  performed  the 
separation  in  the  frequency  domain  using  a DFT  method  with  the  following  steps: 

1.  We  remove  the  linear  trend  of  the  two  end  boundaries  of  the  deviation 
DFT  method. 

2.  Before  computing  the  DFT  sequence  of  the  resultant  deviation  signal,  wc 
further  remove  the  d.c.  component  introduced  by  the  first  step. 

3.  Except  the  d.c.  component,  the  magnitude  of  the  DFT  sequence  below 
one-fifth  sampling  rate  (n/5)  is  set  as  the  average  magnitude  of  the  rest 
DFT  sequence  (see  Figures  2-10(b)  and  2-ll(b)). 

4.  Given  the  new  DFT  sequence  with  the  phase  unchanged,  we  then  lake  the 
inverse  DFT  of  the  new  sequence  to  yield  the  noise  signals  (sec  Figures 
2-10(c)  and  2-1 1(c)). 

Examples  of  the  histograms  of  the  noise  signals  are  shown  in  Figure  2-1 2.  It  appears 
that  the  zero-mean  Gaussian  distribution  provides  a good  fit  for  the  underlying  frequency 
distribution.  We  therefore  assume  that  the  noise  component  exhibits  aGaussian  distribution, 
in  which  the  standard  deviation  is  sufficient  to  characterize  the  statistical  property.  This 
hypothesis  can  be  informally  validated  by  inspecting  the  cumulative  probability  density 
functions  of  the  noise  components  and  by  comparing  them  with  the  corresponding  Gaussian 
distribution  function  with  the  same  mean  and  variance  (Figure  2-13). 

Following  the  terminologies  defined  by  Pinto  and  Titze  (1990),  we  use  6p  and  <5;  to 
denote  the  standard  deviationsof  the  pitch  and  intensity  noise  components  respectively.  The 
following  discussion  illustrates  how  the  dp  and  4/  are  related  to  the  jitter  and  shimmer.  We 
define  the  normalized  jitter  (in  percent)  as 


(2-13) 


(b) 


Figure  2-11.  Exiraciion  of  intensity  noise:  (a)  intensity  contour,  (b)  and  (c)  are  the 
same  as  in  Figure  2-10. 


Figure  2-12,  Histograms  of  (a)  pilch  nois 


1(b)  intensity  i 


I denotes  the  absolute  value,  P,  is  the  ith  pitch  period  in  a segment  of  n pitch  periods, 
s the  mean  value  of  Pi's.  If  we  define  P‘d  = />'  - P0.  then  Eq.  (2-13)  can  be 


■its- 


By  assuming  Pj  to  be  a random  process  w 
1 , the  equation  can  be  approximated  by 


- tt!  P'd  I x 100% 


- <5,  x 100%  = iS2p.  (2-1; 


ivhcrc  the  ovcrbar  denotes  the  sta 


where  A{  defines  the  square  root  of  the  intensity  (rms  power)  of  the  ith  glottal  period,  and 
Aq  is  the  average  of  A,'s.  Compared  to  the  definition  given  by  other  researchers,  where  Ai 
is  the  peak  magnitude,  the  adopted  form  is  more  likely  to  correspond  with  the  perceptual 
characteristics  of  the  human  auditory  apparatus  that  resolve  short-time  spectra  of  acoustic 
signals.  More  important,  it  is  the  power  density  rather  than  the  peak  magnitude  used  for  the 
speech  analysis  and  synthesis  in  the  proposed  speech  production  model. 


From  Eqs.  (2- 1 5)  and  (2-16),  we  know  that  the  %jiuer  and  %shimmcr  arc  just  other 
types  of  manifestations  for  the  pitch  and  intensity  noise.  To  verify  the  foregoing  derivation, 
we  list  the  dp's  and  d/'s  measured  from  the  pitch  and  intensity  noise  signals  as  well  as  Iho.se 
derived  from  %jitter  and  %shimmer  in  Table  2-3.  The  tabulated  values  arc  very  close  to  the 
values  computed  from  two  different  approaches;  such  results  thereby  substantiate  our 
assumptions  with  regard  to  the  perturbation  noise  and  its  relation  to  other  perturbation 
measures. 


2.6.2  Spectral  Till 


Theoretically,  the  transfer  function  of  the  vocal  tract  is  characterized  by  a set  of 
formants,  which  are  distributed  along  the  frequency  axis.  The  spectral  tilt  of  speech  is  mostly 
dominated  by  the  lip  radiation  and  glottal  shaping  filter  (Rabiner  and  Schafer.  1978).  If  the 
lip  radiation  is  modeled  as  a differentiator,  then  the  differentiated  glottal  flow  becomes  the 
most  pertinent  component  to  determine  the  spectral  tilt  of  a speech  signal.  This  implies  that 
the  spectral  tilt  of  the  differentiated  glottal  pulse  can  also  be  estimated  from  the  speech  signal. 

As  discussed  in  Section  2.3,  we  used  a two-pole  filter  to  approximate  the  spectral  till 
of  differentiated  glottal  flow.  The  filter  coefficients  are  now  estimated  using  LP  analysis 
based  on  the  speech  signal.  Table  2-5  lists  the  estimated  LP  coefficients  for  the  three 
different  voice  types. 

2.6,3  Glottal  Phase  Characteristics 

In  addition  to  a general  comparison  of  the  glottal  phase  characteristics  for  the  nine 
utterances,  we  present  a novel  measure  called  "abruptness  index”  to  depict  the  return  phase 
of  the  glottal  flow. 

2.6.3. 1 General  properties 

Because  the  magnitude  spectrum  of  the  residue  signal  is  flat  (not  in  a strict  sense  if 
we  consider  the  spectral  harmonics  and  modeling  errors)  due  to  the  inverse  filtering,  phase 
characteristics  are  the  only  information  left  in  the  residue  that  is  related  to  the  glottal  source 
(Wong  and  Markel,  1978;  Hedelin,  1988).  As  mentioned  earlier,  it  is  the  integrated  residue 
which  resembles  the  differentiated  glottal  flow,  exhibiting  certain  physiological  features  of 
the  vocal  folds.  Presumably,  we  can  explore  the  phase  properties  of  the  glottal  source  by 
examining  the  integrated  residue.  Unfortunately,  the  timing  of  transitional  glottal  events  arc 
distorted  by  the  inverse  filtering  and  integration.  Timing  factors  extracted  from  the 
integrated  residue  are  not  as  useful  as  those 


i for  the  glottal  flow  and,  therefore,  will 


investigated.  Instead,  the  comparison  across  the  integrated  residues  of  the  nine  utterances 
is  performed  using  correlation  coefficients.  The  results  are  given  in  Table  2-4.  These 
results,  however,  do  not  adequately  characterize  the  correlation  between  the  glottal  phases 
and  vocal  quality.  Consequently,  we  introduce  another  measure  called  the  "abruptness 


Table  2-4.  Correlation  coefficie 


i all  utterances. 


B3 


B2 


B1 


V3 


V2 


7995  0.1530  0.7457  -0.5207  0.9687  0.5504  0.8607 

7224  0.2232  0.8203  -0.4217  0.9731  0.5609  0.9056 

6231  0.3244  0.8327  -0.1929  0.8891  0.2379  1.0000 

4206  -0.1995  0.4858  -0.3752  0.5080  1.0000 

7676  0.1488  0.7383  -0.4627  1.0000 

4187  -0.3896  -0.2588  1.0000 

4877  0.3877  1.0000 

0001  1.0000 


M2  Ml 
0.9457  1.0000 

1.0000 


The  idea  for  this  measure  stems  from  the  LF-model.  In  a study  of  the  acoustic 
variability  of  the  glottal  source  factors.  Ahn  ( 199 1)  concluded  that  the  fe.  fe  of  the  LF-model 
wen:  the  two  most  significant  parameters  correlating  to  vocal  quality.  Because  to  and  r,  are 
the  parameters  controlling  the  rapidity  of  the  return  phase  in  the  LF-model,  they  are  accepted 
as  an  indicator  of  vocal  abruptness.  In  Ahn's  study,  k was  defined  as  the  instant  at  which 
the  amplitude  of  the  modeled  differentiated  glottal  flow  dropped  to  1%  of  its  peak  value. 


Accordingly,  i 


: derived  by  solving  the  following  equation: 


where  pf  is  the  pitch  period  and  H can  be  obtained  a priori  by  solving 


(2-17) 


|fe  = l 


(2-18) 


Since  b and  b form  a mathematical  mapping,  we  may  say  that  the  statistical  significance  of 
these  two  parameters  stand  on  the  same  footing.  Based  on  this  understanding,  we  need  to 
focus  on  only  one  parameter.  It  can  be  shown  from  Eq.  (2-1)  that 


(2-19) 


The  equations  above  explicitly  tell  us  that  t,  can  be  readily  obtained  if  we  know  the  derivative 
of  £(r)  at  tf.  Since  the  instant  b usually  coincides  with  the  largest  value  of  dE(i)  and  E,  is 
usually  the  minimum  of  £(r).  it  will  be  convenient  for  us  to  calculate  b using  the  following 
equation: 


..  - min(£(r» 
“ rnaxWEfr)) 


ii 


(2-20) 


where  6i  denotes  an  infinitesimal  time  interval.  For  a discrete  signal,  this  value,  ii,  can  be 
substituted  by  the  sampling  interval,^ T,  provided  that  the  interval  is  sufficiently  small.  If 
we  define  the  vocal  abruptness  index,  /„ . as  the  normalized  b in  percentage,  then  la  becomes 


, _ - min(  E(nAT)  ) AT  lf|n~ 
“ max(  diff  (E(nA  T)  ) Pp 


(2-21) 


where  AT  is  the  sampling  time.  Pp  is  the  pitch  period,  and  diff  stands  for  the  differen 
function.  It  is  obvious  that  /„  is  readily  obtained  once  E(nAT)  is  available. 


The  acquisition  of  the  E(nd  T)  calls  for  an  employment  of  the  glottal  inverse  filtering 
technique,  which  is  not  always  feasible.  Thus,  we  demonstrate  how  to  convert  the  residue 
signal  into  the  differentiated  glottal  flow.  As  discussed  in  Section  2.3,  a two-pole  filter  was 
employed  to  model  the  spectrum  slope  of  the  differentiated  glottal  flow  Us\nAT).  The 
transfer  function  of  Us'(n/ST)  is  given  as 


(2-22) 


where  e(z)  denotes  the  Z-transform  of  the  residue.  The  glottal  differentiated  flow,  Ug'(nAT), 
can  be  approximated  by  fitting  the  residue  signal  into  the  two-pole  filter,  of  which  the 
coefficient  is  derived  by  LP  analysis  of  the  speech  signal.  Substituting  Ug'(nA  T)  for£(nd7) 
in  Eq.  (2-21),  we  can  easily  obtain  the  abruptness  index  (listed  in  Table  2-5  on  page  61). 

2.6.4  Vocal  Noise 

Much  research  has  been  directed  to  estimating  vocal  noise  pertaining  to  a steady 
utterance  or  running  speech.  However,  due  to  the  limited  capability  of  existing  measures, 
the  noise  could  only  be  presented  in  a form  of  signal-to-noisc  or  harmonic-lo-noisc  ratios 
that  do  not  offer  enough  details  of  vocal  noise.  To  gain  a better  understanding  of  vocal  noise, 

ratio  (SNR),  amplitude  modulation,  and  noise  spectra. 

2.6,4. 1 Noise  extraction 

In  order  to  acquire  the  vocal  noise,  techniques  for  identifying  and  separating  the 
prototype  patterns  (i.e..  the  standardized  signal  of  one  pitch  period  along  an  utterance)  are 
required.  The  identification  of  pitch  periods  was  accomplished  by  peak  picking  the  DEGG 
signal.  The  separation  of  noise  from  the  prototype  can  be  achieved  either  in  the  frequency 
domain  or  in  the  time  domain.  There  were  two  approaches  that  influenced  us  most  in  striving 


to  accomplish  the  noise  extraction.  One  of  the  approaches  was  proposed  by  Yumoto  ct  al. 
(1982).  who  considered  the  noise  as  the  deviation  of  a quasi-periodic  speech  signal.  They 
first  derived  a prototype  period  of  phonadon  by  averaging  the  waveform  of  every  period  in 
a steady  utterance.  This  prototype  was  then  subtracted  from  the  speech  signal  for  each  pilch 
period  to  yield  the  noise.  The  use  of  such  an  approach,  however,  requires  that  the  subject’s 
utterances  have  to  be  strictly  steady  for  a number  of  periods.  This  may  not  be  feasible  in  some 
cases.  Thus,  Kasuya  et  al.  (1986  a&b)  proposed  another  measure  by  estimating  the 
harmonic-to-noise  ratio.  The  noise  signal  was  isolated  from  periodical  components  either 
by  using  a comb  filter  or  by  collecting  the  non-harmonics  in  the  spectrum  of  the  analyzed 
signal.  In  such  a method,  any  component  not  harmonically  related  to  the  fundamental 
frequency  was  classified  as  noise. 

Although  Kasuya's  method  was  robust  and  efficient  in  computation,  it  was 
inappropriate  from  the  practical  point  of  view  since  it  look  account  of  the  jitter  and  shimmer. 
In  our  proposed  speech  production  model,  the  vocal  noise  is  considered  to  be  an  independent 
module.  This  design  concept  requires  that  the  unconelated  factors,  such  as  jitter  and 
shimmer,  has  to  be  segregated  from  the  real  noise  source.  Thus,  we  adopt  Yumoto ’s  idea  in 
a modified  form.  Fitst,  we  standardize  the  power  and  length  of  every  pitch  period  of  the 
integrated  residue  using  the  method  discussed  in  Section  2.5.4  before  acquiring  the 
prototype.  Then  we  estimate  the  vocal  noise  by  minimizing  the  least  square  difference, 
E(SiJSp).  between  the  prototype  signal  Sp  and  the  analyzing  signal  Sr 

E(S‘,,S„)  = - yS„(*)]!.  (2-23) 

where  C1  denotes  the  circulation  shift  with  an  / lag,  and  N is  the  length  of  the  standardized 
pitch  period.  The  purpose  of  using  C1  is  to  rectify  the  phase  discongntity  between  S,  and  Sp. 
The  scale  factor  y is  then  determined  by  setting 


dE(^,Sp)/dY  = 0, 


(2-24) 


which  leads  to 


X C-(Sfk)tS^k) 


(2-25) 


The  lag  m that  yields  the  maximum  y is  chosen  be  the  correct  phase  offset.  Thus,  the  noise 
signal  becomes 


A problem  of  this  approach  above  is  related  to  determining  an  appropriate  number 
of  periods  to  form  an  analysis  window.  The  prototype  derived  from  a short  window  is 
statistically  unreliable.  On  the  other  hand,  it  is  unlikely  that  a steady  phonation  is  maintained 
throughout  the  utterance.  Therefore,  we  have  to  examine  the  influence  of  the  number  of 


the  analysis  window  is  to  search  for  the  minimum  period  that  gives  a small  standard 
deviation.  Here  we  use  three  different  utterances  as  a pilot  experiment  As  shown  in  Figure 
2-14.  the  standard  deviations  of  three  samples  are  relatively  large  when  the  analysis  window 
is  small.  When  the  analysis  window  is  increased  to  mote  than  15  periods,  both  the  standard 
deviations  and  mean  values  become  stable.  Thus,  the  prototype  is  calculated  using  a window 
of  15  consecutive  periods,  in  which  the  current  period  is  located  at  the  center  of  the  window. 
The  selected  number  is  somewhat  smaller  than  that  suggested  by  other  researchers  (Yumoto 
etaL,  1982;Tilzeetal.,  1987;  Eskcnazi  et  al„  1990).  We  reason  that  this  result  is  due  to  the 


<■(*>  = C"(S,{*))  - yS„ik). 


(2-26) 


periods  with  regard  to  the  noise  measure. 


modal  (Ml) 


. • HtotoilUIIlI 


vocal  fty  (VI) 


I.  1 1 1 1 1 1 1 1 1 1 1 1 1 


breathy  (Bl) 


Figure  2-14.  Variation  of  SNR  versus  number  of  analyzed  periods 


Another  problem  causing  concern  is  the  fluctuation  of  low  frequency  power. 
Because  the  air  escaping  from  the  lungs  is  not  continual,  a frequent  change  oflow  frequency 
components  is  anticipated.  Fortunately,  owing  to  the  fact  that  the  noise  in  the  low-frequency 
region  is  perceptually  masked  by  the  harmonics  of  the  fundamental  frequency  (Childers  and 
Lee,  1991),  we  can  apply  a notch  filter  to  eliminate  the  low  frequency  components  without 
disturbing  the  perceived  quality.  Tbecut-ofT  frequency  of  this  highpass  zero-phase  filter  can 
be  designed  to  adapt  to  the  current  pitch  period  P,  such  that  low-frequency  components 
below  500  Hz  are  sufficiently  suppressed  and  high  frequency  components  are  not  affected. 
The  frequency  response  of  the  highpass  filler  is  given  by 


512,  is  the  length  of  the  pitch  period  after  interpolation.  The  scalar  factor.  (2-a)/2,  in  Eq. 
(2-27)  is  to  make  the  magnitude  unity  at  the  one-half  the  sampling  frequency.  Eventually, 
the  desired  noise  signal  is  the  result  after  passing  n(t)  through  the  notch  filter. 

2.6.4.2-BtEpcrticii  of  vocal  noise 

Once  we  get  the  desired  noise,  the  properties  to  be  evaluated  are: 

• Signal-to-Noisc  ratio 

The  Signal-to-Noise  Ratio  (SNR)  for  the  ith  pitch  period  is  calculated  as 


2-o 


2 l-(l-o)z- 


(2-27) 


SNR,  = 101og10  - 


(2-28) 


: listed  in  Table  2-5. 


• Amplitude  modulation 

As  shown  in  Figure  2-15.  the  amplitude  modulations  is  obtained  by  averaging  the 
magnitude  of  the  noise  signal  over  all  periods. 

• Noise  spectrum 

The  spectrum  of  the  noise  signal  is  computed  using  the  FFT.  Though  the  length  of 
every  pitch  period  has  been  expanded  to  512  samples,  the  frequency  resolution  of  the  FFT 
sequence  still  depends  on  the  actual  fundamental  frequency.  Due  to  the  fact  that  the 
fundamental  frequency  may  change  from  period  to  period,  the  resolution  of  each  FFT 
sequence  is  therefore  different  from  each  other.  Thus,  we  apply  the  biharmonic  interpolation 
on  the  FFT  sequences  to  achieve  a unique  frequency  resolution.  The  individual  FFT  spectra 
are  then  averaged  to  yield  an  estimation  of  the  noise  spectrum.  Figure  2-16shows  the  period 


2-17.  We  recall  that  the  noise  is  extracted  from  the  residue  signal,  which  is  obtained  using 
techniques  addressed  in  previous  sections.  The  overall  procedure  is  tedious,  but  is 
straightforward  and  easy  to  implement 


The  results  we  have  gained  so  far  can  be  summarized  in  four  aspects: 

(1)  It  was  found  that  the  perturbation  noise  can  be  modeled  by  a zero-mean  Gaussian 


be  used  to  characterize  the  source  perturbations.  In  particular,  we  have  used  the  %jittcr  and 
Sfcshimmcr  to  indicate  the  standard  deviations  of  the  noise  sources.  The  results  of  measured 


2.7  Discussion 


low-frequency  drift  Measures  that 


I the  drift  could 


perturbations  with  respect  to  thrc 
researchers’  reports,  i.c.,  vocal  fry ; 


s.  in  general. 


(a)  modal 


Figure  2-15.  Amplitude  modulations  of  vocal  noise  for  different  voice  types. 


Figure  2-16.  Spectra  of  vocal  noise  for  different  voice  types. 


Figure  2-17.  Schematic  flowchart  fo 


pilch  subject.  Interpreted  from  a psychoacoustic  perspective,  such  values  would  have 
different  impacts  to  the  perception  of  vocal  quality  (Wendahl,  1963).  Furthermorc, 
loudness,  a perceptual  descriptor  of  the  intensity,  was  reported  to  be  a nonlinear  function  for 
various  frequencies  (Robinson  and  Dadson,  1956).  These  factors  confound  the  study  of 
voice  quality  merely  on  the  basis  of  quantitative  measures.  To  achieve  a thorough 
understanding  of  vocal  quality,  the  research  scope  should  cover  speech  perception  as  well 
(Flanagan,  1972a;  Bladon  and  Lindblom,  1981;  Hcrmansky  ct  al.,  1985;  Wang  el  aL, 
1991). 

(2)  A comparison  of  the  spectral  tilt  of  the  source  can  be  performed  by  visually 
inspecting  the  frequency  responses  of  the  two-pole  filter  model.  As  shown  in  Figure  2-18, 
the  spectral  dll  is  moderate,  relative  flat,  and  steep  for  vocal  fry,  modal,  and  breathy  voices, 

coefficient  at,  since  the  coefficient  at  and  the  poles  n.2  of  the  modeled  filter  have  the 

a,  = - 2b|lcosfl  = - 2lz,l  if  6 * 0,n 

= -lz,  + z2l  if  9 = 0,  n (2-29) 


Hence,  the  value  of  at  can  be  used  to  indicate  how  close  the  poles  are  to  the  unit  circle.  A 
larger  a/  corresponds  to  a flatter  spectral  tilt  and  broader  bandwidth.  According  to  the  data 
in  Table  2-5,  the  values  of  at  for  different  voice  types  exhibited  the  following  inequality, 
(lathtealliy)  > (lailModel)  > { lailBresihy }- 

This  result  is  congruent  with  the  previous  observation  in  Figure  2-18  and  with  the 
concl  son  hown  in  Table  2-2. 

(3)  As  the  relation  between  the  residue  signal  and  glottal  flow  was  unveiled,  the 
glottal  phase  characteristics  could  be  traced  back  from  the  residue  signal.  We  have  studied 


Figure  2-18.  Frequency  responses  of  iwo-pole  fillers  for  nine  subjects. 


nd  differences  among  ihc  nine  integrated  residue  signals  by  examining  I 


correlation  coefficients.  It  was  found  that  no  similarity  exislscxccpt  for  modal  voices.  This 
result  suggests  that  the  glottal  phase  characteristics  cannot  be  described  with  a general 
pattern.  On  the  contrary,  the  abruptness  index  showed  an  advantage  for  characterizing  voice 
types.  The  values  of  the  abruptness  indexes  for  three  voice  types  are  presented  in  descending 
order  as  vocal  fry.  modal,  and  breathy  voices.  Such  a measure  may  enable  researchers  to 
classify  voice  types  with  considerable  convenience.  It  is  worth  noting  that  the  meaning  of 
this  measure  can  be  interpreted  from  various  aspects.  In  the  lime  domain,  it  is  related  to  the 
temporal  transition  from  the  maximum  glottal  closure  instant  to  the  glottal  opening.  In  the 
frequency  domain,  it  indicates  the  spectral  slope  of  the  glottal  source.  From  the  point  of  view 
of  the  residue  signal,  it  corresponds  to  the  peak  factor  of  the  main  excitation  pulse. 

(4)  The  estimated  SNR's  for  various  voice  types  supported  the  earlier  finding  of  other 
researchers  that  breathy  voices,  in  general,  were  accompanied  by  the  largest  vocal  noise. 
Such  a noise  level  was  distinctive  enough  to  underscore  its  role  in  source  modeling.  An 
intriguing  observation  with  the  breathy  voices  is  that  the  standard  deviations  of 
corresponding  SNR’s  are  as  small  as  that  of  modal  voices.  This  result  suggests  that  a steady 
noise  source  would  be  appropriate  to  model  the  vocal  noise  for  modal  and  breathy  voices. 

As  displayed  in  Figure  2-14,  the  noise  spectra  for  different  phonations  were  fairly 
flat,  suggesting  that  the  noise  for  the  integrated  residue  is  white.  However,  for  the  purpose 
of  speech  synthesis,  the  noise  source  has  to  be  pro-emphasized  by  a highpass  filter  before 
applying  to  a LP  synthesizer. 

The  amplitude  modulation  of  the  vocal  noise  generally  resembles  the  magnitude  of 
the  integrated  residue  (Figure  2-15).  However,  the  high  amplitude  modulation  near  the 
glottal  closure  may  also  be  ascribed  to  the  phase  misalignment.  Notice  that  there  are  two 
types  of  randomness  presented  in  the  residue  signal:  one  is  the  epoch  variation  caused  by  the 
vocal  fold  closure,  and  the  other  is  the  variation  due  to  the  airflow  turbulency  from  the  lungs 


(Kang  and  Everett,  1985).  The  inicgratic 


adjustment  procedure  in  favour  of  the  airflow,  thus  increasing  the  degree  of  mismatching  for 
the  epoch  variation.  The  accusation  above  can  be  further  verified  by  the  subsequent 
derivation.  Suppose  there  is  a phase  offset,  0,  between  two  signals.  s(»)  and 
s;(r)=f&{j(r)e^®),  then  the  difference  «(»)  is 
«(«)  = 5(0  - 5,«) 

- s(,)  - KjrfOe-''’) 

= 2s(0sin2<§) 

“5(0y  .foriei  •*  | (2-30) 

Clearly  from  Eq.  (2-30).  the  error «(/)  is  proportional  to  the  signal,  resulting  in  the  similarity 
between  the  amplitude  modulation  of  the  extracted  noise  and  the  magnitude  of  the  integrated 
residue. 

So  far  it  is  undetermined  whether  the  amplitude  modulation  is  an  artifact  of  the 
analysis  method  or  is  a primitive  feature  of  the  glottal  source.  We  will  revisit  this  issue  in 
Chapter  4.  But  one  thing  is  certain  here,  that  is,  the  quality  of  synthetic  speech  is  affected 
by  the  simulation  of  vocal  noise. 


2.8  Conclusion 

In  this  chapter  we  have  explored  the  acoustical  features  within  the  source-filter 
theory.  The  properties  of  the  glottal  source  were  primarily  extracted  from  the  integrated 
residue  signal,  which  was  obtained  by  making  use  of  the  pitch  synchronous  LP  analysis  with 
the  aid  of  the  DEGG  signal.  We  demonstrated  the  analysis  methods  using  sustained  vowels, 
lil's,  of  three  voice  types.  i.e.,  modal,  vocal  fry,  and  breathy  voices.  The  roles  of  many 
existing  acoustic  measures  were  carefully  investigated.  Although  more  extensive 
investigations  arc  needed  in  order  to  establish  statistical  significances  of  model  parameters, 
the  results  of  our  study  provided  a basic  understanding  of  source  variations  as  well  as  their 


s.  More  important,  the  capabilities  of  extracting  the 


properties  using  LP  analysis  were  substantiated.  The 


method  in  speech  analysis  suggesls  that  a high  quality  LPC  synthesizer  is  achievable.  Of 
course,  this  is  under  the  assumption  that  the  properties  of  the  glottal  source  are  faithfully 

features  that  are  usually  ignored  in  many  LP  synthesizers,  i.e.  the  vocal  noise  and  the  glottal 


CHAPTER  3 
SOURCE  MODELING 


Since  it  was  introduced  in  the  late  1960s,  the  linear  predictive  coding  (LPC) 
technique  has  been  extensively  used  in  speech  processing  and  coding  (Rabiner  and  Schafer, 
1978).  Speech  synthesizers  considered  in  the  class  of  LPC  coders  use  a slowly  time-varying 
all-pole  ftlter  to  model  the  composite  spectral  characteristics  of  the  glottal  flow,  vocal  tract 
and  lip  radiation.  The  excitation  for  this  all-pole  filter  is  a spectrally  flat  signal  with 
quasi-periodic  phases  for  voiced  speech  and  random  phases  for  unvoiced  speech.  In  this 
study,  we  apply  a sixth  order  polynomial  model  to  delineate  the  phase  characteristics  of 
voiced  source  excitation.  Source  features  extracted  by  this  model  ate  further  compressed 
through  a vector  quantization  technique.  A 32-entry  glottal  codebook  is  derived  by 
quantizing  the  voiced  samples  uttered  by  20  subjects.  On  the  other  hand,  a 256-entry 
stochastic  codebook  is  generated  for  unvoiced  speech  synthesis.  However,  unlike  the  glottal 
codebook,  codewords  in  the  stochastic  codcbook  are  simply  taken  from  a Gaussian  noise 

3.1  Review  of  Previous  Research 

Over  the  years,  various  types  of  excitation  have  been  proposed  to  drive  the  synthesis 
filter  to  produce  speech.  In  the  conventional  pitch-excited  LPC  vocoder  (Atal  and 
Hanaucr,  1971),  the  excitation  signal  is  either  an  impulse  train  for  voiced  speech  or  a 
random  noise  for  unvoiced  speech.  The  quality  of  synthesized  speech  in  some  applications 

oversimplified  excitation  functions  (Wong.  1980;  Kahn  and  Garst,  1983). 


. such  as  the  Multi-Pulse  (MP). 
Code-Excited  (CE)  or  their  relatives  (Atal  and  Remde.  1982;  Schrocdcr  and  Atal,  1985; 
Singhal  and  Alai,  1989;  Rose  and  Barnwell.  1 990)  can  result  in  high-quality  synthetic  speech 
if  the  synthetic  excitation  is  described  sufficiently  well  by  adequate  number  of  codewords 
or  pulses.  Coders  using  this  type  of  excitation  go  beyond  spectral  analysis  and  pitch 
estimation.  Features  not  representable  by  predictive  filters  can  be  recovered  by  formulating 
the  excitation  signal.  That  is.  the  excitation  signal  is  formed  by  searching  for  the  best 
candidate  in  a given  set  of  innovative  sequences  by  minimizing  the  spectrally  weighted 
difference  between  the  original  and  the  synthesized  speech  signals. 

In  fact,  the  ideal  excitation  for  LP  synthesizers  is  the  residue  signal  obtained  by 
inverse  filtering  of  the  original  speech  signal.  Attempts  have  been  made  to  encode  and 
transmit  the  residue  signal  in  many  coding  systems  (Un  and  Magill.  1975.  Dankberg  and 
Wong.  1979).  But  little  research  effort  has  been  directed  to  extracting  the  features  of  the 
residue  signal.  In  1 978  Wong  and  Markel  constructed  a prototype  excitation  pulse  by  inverse 
nilering  the  differentiated  glottal  flow  of  the  vowel  Id.  Although  this  excitation  pulse  was 
intentionally  designed  to  reduce  the  buzziness  of  synthesized  speech,  both  quality  and 
naturalness,  as  expected,  were  improved  due  to  the  preserved  glottal  characteristics. 
However,  the  excitation  pulse  presented  in  their  experiment  has  certain  drawbacks.  First, 
it  is  feasible  only  when  the  fundamental  frequency  is  below  160  Hz.  Second,  a single 

various  speakers  and  phonations  can  vary  considerably. 

The  importance  of  glottal  characteristics  for  speech  synthesis  was  also  demonstrated 
by  Bergstrom  and  Hedelin  (1989).  Finding  the  similarity  between  the  residue  and  the  second 
derivative  of  the  glottal  pulse,  they  incorporated  the  glottal  pulse  into  a CELP  coder  by 
adding  an  extra  codcbook.  The  resultant  quality  of  synthetic  speech  was  reported  to  be 
favored  over  the  quality  produced  by  the  primitive  CELP  coder.  Rcccndy,  the  incorporation 
of  the  residual  features  by  means  of  excitation  codebooks  also  gained  a certain  degree  of 


successfulness  in  synthesizing  natural  speech  at  2.4  Kh/s  (Haagen  el  al.  1992;  Zhang  and 
Chen.  1992). 

Other  attempts  to  replace  the  residue  by  stylized  pulses  appear  in  the  work  by  Sambur 
et  al.  (1978)  and  by  Childers  and  Wu  (1990).  Among  the  tested  pulses,  the  differentiated 
electroglottograph  (DEGG)  signal  was  found  to  produce  good  quality.  Such  a result 
occurred  because  the  DEGG  signal  reflects  the  glottal  characteristics  and  has  a rather  flat 

"divide-and-conquer”  strategy  to  depict  the  residue.  Some  divided  the  spectrum  of  the 
residue  into  several  frequency  bands  and  examined  the  corresponding  spectral 
characteristics  for  each  band  (Makhoul  et  al,  1978;  Kwon  and  Goldberg.  1984;  Griffm  and 
Lim,  1988;  McCrcc  and  Barnwell,  1991).  The  excitation  signal  was  then  formed  by 

exhibit  a flat  spectrum.  If  there  were  only  two  spectrum  bands  to  be  specified,  the  model 
was  often  referred  to  as  the  mixed  excitation  since  it  resulted  in  a mixture  of  low  frequency 
pulses  and  high  frequency  noise.  If  the  number  of  divided  bands  matched  that  of  pitch 
harmonics,  this  type  of  excitation  became  a superposition  of  sinusoids  and  was  named  under 
its  general  properties  as  either  harmonics  or  sinusoids  (Trancoso  et  al.,  1990).  On  the  other 
hand,  such  a "divide-and-conquer”  strategy  was  also  considered  by  researchers  for  use  in  the 
time  domain.  Sreenivas  (1988)  parsed  the  residue  signal  into  three  parts.  i.e..  high  energy 
pulses,  a low  energy  smooth  component,  and  a random  noise  component  Each  component 
was  acquired  by  using  a distinctive  feature.  For  instance,  the  high  energy  pulses  were  found 
based  on  a error  minimization  scheme  similar  to  MPLP  coders.  After  subtracting  the  pulses 
from  the  residue,  the  smooth  component  was  calculated  by  vector  quantization.  Likewise, 
the  noise  component  was  determined  by  codeword  searching  as  in  CELP  coders.  Such  an 
approach  has  proven  useful  for  speech  coding  in  the  range  of  9.6  Kh/s.  Sukkar  et  al.  ( 1 989) 
decomposed  the  residue  into  a set  of  orthogonal  functions  called  Zinc-functions.  They 


claimed  lhai  ihc  Zinc-funciion  is  superior  to  the  Fourier  expansion  for  modeling  the  residue 
in  the  mean  square  error  sense.  However,  even  though  both  the  frequency  and  time  domain 
approaches  offered  better  synthetic  quality,  none  of  the  above-mentioned  models  provided 
a clue  to  describe  the  glottal  features  parametrically. 

From  the  above  discussion,  it  appears  that  the  quality  of  synthesized  speech  can  be 
improved  once  we  attend  to  the  basic  features  of  the  residue  signal.  Our  investigation 
showed  that  the  residue  was  closely  related  to  the  glottal  volume  velocity  via  the  glottal 
shaping  filter  (see  Section  2.3).  In  fact.  Kang  and  Everett  (1985)  have  demonstrated  how 
to  improve  the  quality  of  the  pitch-excited  LPC  vocoder  through  the  exploitation  of  the 
amplitude  and  phase  spectra  of  the  residue.  It  was  also  reported  that  high-quality  LP 
synthesis  could  be  achieved  by  introducing  an  extended  filter  which  captured  some  of  the 
glottal  phase  characteristics  (Caspersand  Alai.  1987;Hedclin,  1988).  The  improvement  due 
to  the  appropriate  modeling  of  glottal  source  is  more  evident  when  a glottal  flow  model  is 
applied  to  the  formant  synthesizers  (Rosenberg,  1971;  Holmes.  1973;  Klatt,  1980;  Pinto, 
et  al.  1989).  but  such  perceptually  important  features  have  not  been  widely  considered  in  LP 
synthesizers.  Our  primary  goal  is  to  design  an  efficient  excitation  model  to  simulate  the 
residue  so  that  we  may  achieve  high-quality  natural-sounding  speech  production  using  such 


3.2  Excitation  Source 

In  a manner  similar  to  that  adopted  in  the  traditional  LP  synthesizer,  we  classify  the 
excitation  function  into  two  categories,  i.e..  voiced  and  unvoiced.  Accordingly,  two 
different  strategies  are  employed  to  analyze  and  process  the  speech  signal. 

3.2.1  Voiced  Segments:  Excitation  Pulse 

In  Section  2.3,  we  have  shown  that  the  phase  characteristics  of  a glottal  flow 
waveform  could  be  retrieved  from  the  residue  signal.  However,  since  the  zero-reference 


level  of  the  glottal  flow  has  been  destroyed  due  to  the  inveise  filtering  and  integration,  source 
models  that  specify  the  differentiated  glottal  flow  ate  not  suitable  for  modeling  the  integrated 
residue.  We  therefore  propose  a new  model  to  code  the  integral  of  the  residue.  This  model 
isdescribedbyasixthorderpolynomial  fix)  = y c/ . which  is  specified  within  the  interval 
[0,1]  subject  to  three  constraints  listed  below. 

1.  /(0)  = - 1.  (3-1) 

2.  /(l)=/(0).  (3-2) 


where  the  interval  boundaries,  0 and  1 , correspond  to  the  glottal  closure  instants  (GCI).  The 
order  of  the  polynomial  is  empirically  chosen  to  be  tax  because  it  sufficiently  describes  the 
integrated  residue  without  causing  rank  deficiency. 

The  purpose  of  the  constraints  is  as  follows.  The  fiist  constraint  is  used  to  normalize 
the  magnitude  of  the  largest  negative  peak.  The  second  constraint  is  to  ensure  the  circular 
continuity  between  consecutive  periods.  It  is  also  equivalent  to  the  following  expression: 


which  indicates  that  the  d.c.  component  in  the  residue  signal  is  eliminated.  TTic  third 

Because  of  these  constraints,  only  four  degrees  of  freedom  are  available  in  the 

under  such  constraints,  we  can  introduce  Lagrange  multipliers  and  solve  a set  of  equations 
as  in  an  optimal  control  system.  Nonetheless,  the  main  purpose  of  these  constraints  is  not 


(3-3) 


(3-4) 


lo  limit  ihe  dynamics  of  the  polynomial  coefficients  while  carrying  out  the  optimi/auon. 
Instead,  the  constraints  are  just  used  to  regulate  the  polynomial  waveform.  They  can  also 
be  satisfied  by  adjusting  a tentative  polynomial,  which  is  calculated  based  on  a least  square 
fit  Here  we  apply  a weighting  function  to  emphasize  the  polynomial  fitness  around  the  GCI 
since  this  region  is  direedy  related  to  the  primary  excitation  pulse.  The  weighting  function 
is  given  by 


I M0i?-40x  + i forOSxS.l  m 
W(x)  = 1 for  .1  < * < .8  H (3-5) 

| 2S*2  - 40x  + 17  for  .8  s x S 1.  El 

and  is  displayed  in  Figure  3-1  - In  practice,  the  weighting  function  can  also  reduce  the  chance 
of  rank  deficiency  while  we  perform  the  polynomial  fit. 

Once  we  obtain  the  tentative  polynomial,  the  first  constraint  can  be  achieved  by 

C,  = - ^ fori  = 0. 1,2, 3, 4,5, 6.  (3-6) 

The  second  constraint  can  be  satisfied  by  seeking  a value  v close  to  1 such  that/(v)  is  1. 
Accordingly,  the  polynomial  coefficients  are  revised  as 

C,  = Cy  for  r = 1,2, 3,4, 5, 6.  (3-7) 

level.  We  can  modify  the  constant  Co  to  accomplish  this  requirement 


c.--£j£r 


(3-8) 


Figure  3-1 . Plot  of  the  weighting  function,  W(*). 


Thus,  the  resultant  integral  for  a period  I 


above-mentioned  adjustments  shall  be  arranged  as  Eq.  (3-7).  then  Eqs.  (3-8)  and  (3-6)  in 
order  to  prevent  any  further  conflict  among  the  constraints. 

3.2.1.1  Vector  quantization 

Like  many  other  glottal  source  models,  the  polynomial  model  only  provides  a rough 
description  of  the  glottal  phase  characteristics.  The  lack  of  detailing  of  the  glottal  phase  may 
lead  to  a degradation  in  quality  of  synthetic  speech.  However,  in  a study  concerning  the 
influence  of  glottal  flow  waveforms  on  the  quality  of  voiced  synthetic  speech,  Ronscnbcrg 
(1971)  concluded  that  only  gross  source  features  arc  required  to  preserve  the  quality  whereas 
temporal  and  spectral  details  are  less  important  Assertions  regarding  the  phase 
characteristics  were  further  supported  by  other  researchers  (Atal  and  David,  1979;  Hedeiin, 
1986).  Their  results  lead  us  to  speculate  that  the  glottal  excitation  acquired  by  our  model 
may  provide  sufficient  discriminatory  information  in  order  to  synthesize  good  quality 
speech.  It  is  noted  that  vector  quantization  techniques  have  demonstrated  good  performance 
in  compressing  LP  features  with  a relatively  low  bit  rate  (Linde  ct  al.,  1981;  Gray,  1984). 
We  believe  that  the  glottal  phasecharacteristicsponraycdbyoursourcemodelcouldbemore 
concise  via  an  appropriate  vector  quantizer,  at  least  in  terms  of  perceptual  quality. 

amplitude  sample  into  one  of  a set  of  discrete-amplitude  samples  suitable  for  storage  and 
commumcauon  in  a digital  system.  The  process  is  known  as  scalar  quantization,  if  each 
individual  sample  is  quantized  independently.  When  a block  of  samples,  usua 


illydcfin 


Given  a ^-dimensional  Euclidean  space  R*.  a vector  quantizer  considered  a criterion 

partitioning  R*  into  a finite  subset  Y of  R*.  where  Y =(«:  i=l,2 AT)  is  die  set  of 

reproduction  vectors  and  N the  number  of  vectors  in  Y.  The  set  Y is  called  a codebook  and 
its  elements  are  called  codewords  or  codevectors.  In  principle,  the  codeword  y,  is  chosen 
to  minimize  the  average  distortion  for  each  quantized  cell.  The  distance  between  any  input 
vector  and  its  corresponding  codeword  is  known  as  the  distortion.  Once  these  codewords 
are  established,  any  input  vector  is  then  assigned  to  a particular  codeword  based  on  minimum 
distortion  for  optimal  representation.  More  specifically,  pattern  vector  x is  encoded  by 
codeword  y,  if  the  distance  between  those  two  vectors  is  less  than  the  distance  to  any  other 
codeword,  Le„ 

d(x,yd  < d(x,yj).  j*i;ij=\...N  (3-10) 

where  the  function  d denotes  the  distance  measure,  and  N is  the  number  of  codewords.  A 
major  advantage  with  the  vector  quantizer  is  that  it  often  reduces  the  number  of  bits  required 
to  represent  the  input  vector  under  a specific  distortion  measure.  Indeed,  this  advantage  can 
be  formally  proven  through  mathematical  derivations.  According  to  the  Shannon 
rate-distortion  theory,  the  vector  quantizer  always  achieves  higher  data  compression  ratios 
than  any  coding  scheme  based  on  the  scalar  quantities  for  a given  transmission  bit  rare. 
Because  of  this,  during  the  past  decade,  the  vector  quantization  has  received  much  attention 
as  a data  compression  technique  for  encoding  data  in  information  intensive  fields  such  as 
image  and  speech  signals. 

A vital  step  in  establishing  the  vector  quantizer  is  generation  of  an  accurate 
codcbook.  Here  the  word  "accurate"  stands  for  having  minimum  distortion.  The 
accomplishment  of  this  step  requires  a criterion  to  quantify  the  Euclidean  space  and  a 
distortion  measure  to  define  the  performance  of  a quantizer.  There  are  two  distortion  criteria 
commonly  adopted  for  vector  quantizers,  namely,  either  minimizing  the  average 


£ = - 2>,  log2(/>,) 


(3-11) 


disionion,  ihe  most  efficient  way  to  quantize  the  vector  space  is  to  let  each  quantized  cell 
(also  known  as  "cluster”  in  some  literature)  consist  of  the  same  entropy.  Conceptually, 
minimizing  the  average  quantization  error  can  be  viewed  as  a scheme  pcrfoiming  a 
geometric  division  of  the  vector  space,  while  maximizing  the  entropy  is  a scheme  to  achieve 
a popular  division  of  the  vector  space.  Our  philosophy  of  quantizing  the  vector  space  is  to 
minimize  the  quantization  error  but  at  the  same  time  to  maximize  the  selected  frequency  of 
each  codeword.  It  appears  that  the  sum  of  intra-cluster  distortion  serves  as  a proper  criterion 
for  cluster  splitting  because  this  criterion  takes  both  geometric  and  papular  division 
properties  into  account  (Tou  and  Gonzalez,  1974;NyeckandTosser-Roussey,  1992).  Under 

split  despite  their  intra-cluster  distortions  are  low.  Hence  the  codchook  space  is  not  wasted 
in  accommodating  unusual  pattern  vectors  of  glottal  phase  signals. 

A perfect  partition  for  the  pattern  space  may  be  quite  difficult  to  accomplish, 
although  it  is  theoretically  obtainable  when  the  distortion  measure  is  specified  and  the 
probability  density  function  of  input  vectors  is  known.  Such  a difficulty,  however,  can  be 
circumvented  by  making  use  of  long  training  sequences  that  approximately  represent  the 
probability  density  function.  Thus,  if  the  vector  process  is  ergolic  and  stationary,  averaging 
the  distortion  for  a large  amount  of  training  vcctots  is  equivalent  to  applying  the  probablistic 
model  to  Ihe  underlying  process.  Since  each  vector  is  mapped  into  only  one  particular 
codeword,  the  codewords  themselves  may  be  established  through  clustering  techniques.  In 
fact,  the  optimal  codeword  is  just  the  centroid  of  its  associated  clusters  subject  to  a selected 
distortion  measure.  This  implies  that  the  cluster  analysis  algorithms  in  pattern  recognition 
literature,  such  as  AT-means,  ISODATA,  DYNOC,  and  some  neural-net  techniques  can  be 


i (Tbu  and  Gonzalez,  1974;  Tou,  1979;  Pao,  1989). 


used  to  categorize  the  training  vet 
hyperplane  partitioning  the  clusters 

3.2. 1.2  Maximum  decent  algorithm 

In  this  study  we  generate  a 32-entry  codebook  using  a maximum  decent  algorithm 
(Ma  and  Chan,  1991).  We  note  that  the  size  of  the  codebook  Is  just  tentative.  This  number 

The  maximum  decent  rule  says  that  the  clusters  are  chosen  one  at  a time  attempting 
to  achieve  a maximum  reduction  of  the  sum  of  the  distortions.  As  illustrated  in  Figure  3-2, 
we  begin  the  splitting  routine  by  placing  all  vectors  in  a global  cluster.  After  forming  the 
first  two  clusters,  we  compare  the  reduction  functions,  S;  and  Rj,  of  the  two  new  clusters 
and  then  split  the  one  giving  the  larger  reduction.  To  generalize  the  preceding  procedures, 
let  us  consider  the  case  of  forming  n+1  clusters  based  on  a set  of  n clusters.  The  cluster  S,„ 
(m  £ n)  is  split  into  two  new  clusters  if  Rm  is  the  largest  among  all  the  R,  ’s  of  the  n clusters. 
Hence  the  set  of  n+1  clusters  is  the  one  that  gives  the  maximum  decent  distortion  when 
formed  from  the  set  of  n clusters.  The  algorithm  iterates  until  the  desired  number  of  clusters 
is  obtained.  Finally,  the  centroids  of  the  clustcts  are  taken  as  the  codewords. 

time  is  significantly  reduced  since  only  the  R,  's  of  the  two  newly  formed  clusters  need  to  be 
computed  while  all  other  clusters  have  been  calculated  in  the  previous  iteration,  and  (2) 
empty  clusters  are  prevented  since  it  is  impossible  for  a single-member  cluster  to  be  chosen 
for  splitting. 

3.2. 1 .3  Cluster  splitting 

Since  each  codeword  represents  the  centroid  of  a specific  cluster,  the  size  of  the 
codebook  equals  the  number  of  clusters  partitioned  in  the  pattern  space.  We  adopt  a splitting 
technique  to  carry  out  the  cluster  partition.  This  technique,  in  general,  is  not  guaranteed  to 


Figure  3-2.  Cluster  splitting  using  ihe  Maximum  Decern  method. 
D(5,  )= sum  of  distortions. 

S(S1>=  reduction  of  distortion  due  to  cluster  splitting. 


nth  a binary j 


scheme  (Bun)  etal.,  1980).  Steps  for  splitting  each  given  cluster  are  summarized  as  follows: 


Step  5.  Assign  the  initial  centroids  by  using  the  extreme-point  approach. 

which  will  be  discussed  in  Secdon  3.2. 1.3.1. 

Step  6.  Parddon  the  cluster  vector  on  the  basis  of  minimum  distortion,  i.e. 
if  d(X|,y,)  < d(x,.y2).  Xj  E Sn  ; 
otherwise,  Xj  E Sa  . 


Step  7.  Obtain  the  new  ccntroi 


signed  to  S;  at 
iber  of  iterador 


where  N,  and  N2  are  the  number 
respectively.  The  superscript  / der 
Step  8.  Calculate  the  reduction  of  distortion  due  to  splitting ; 

«(S|)  = DfSJ  - |/>(Sn)  + £KSq)]: 


The  outcome  of  the  cluster  analysis,  in  general,  will  be  affected  by  three  face 
‘d  by  properly  selecting  the  training  vectors.  At  this  stage,  we  can  assume  that 


: exemplary  both  in  comple 


nd  equilibrium.  Thu 


concerned  only  wilh  the  initial  centroids  and  distortion  measure. 

3.11.3.1  Initialization  of  centroid 

Several  methods  for  determining  die  initial  codewords  exist  We  may  simply  choose 
the  first  two  training  vectors  as  our  initial  centroids,  similar  to  the  manner  used  in  the 


K-means  method.  However,  simply  choosing  the 

: first  two  vectors  will  not  produce  an 

accurate  result  if  these  two  vectors  are  close  to  each 

; other.  Intuitively,  one  would  like  these 

two  vectors  to  be  well-separated.  We,  therefore,  a: 

tsign  the  two  initial  centroids  using  the 

following  approach.  Let{*i,*2,*3,....*N)  be  the  tV si 
by 

tmplcvectors.  The  mean  vector  is  given 

(3-12) 

Using  ro  as  a reference  vector,  we  first  find  a vecto 

r that  is  farthest  from  zt>.  That  is. 

d(*m.2o)  > for  / * m; 

i.m  = 1 N.  (3-13) 

This  vector  xin  is  selected  as  one  of  the  extreme  vecto 
for  the  vector  that  is  farthest  from  x„. 

3.11.3.2  Distortion  measure 

its.  Thcotherisdeterminedbysearching 

As  mentioned  earlier,  the  feature  space  core 

lists  of  the  polynomial  coefficients.  We 

adopted  the  Euclidean  distance  as  the  distortion  m, 

insure,  which  is  defined  as 

d„  = jw*)  -//*»’* 

(3-14) 

where  dg  is  the  resulting  distortion  of  two  arbitrary  polynomials,^*)  and  fj(x).  The  centroid. 
j5(x),  of  a cluster,  St,  is  chosen  as 


where  hi  is  the  number  of  vectors  inside  Sc.  Thus,  the  sum  of  distortions  D(S*)  for  the  cluster 
St  is  given  by 

D(St)  = £ I m -f&tfdx.  (3-16) 

Let  Pc  define  the  vector  of  the  polynomial  coefficients.  (f,M  -JiW),  in  a descending  order. 
The  polynomial  multiplication  of  (|5(*)  — Ji(*))2  is  equivalent  to  convolving  Pc  with  itself, 
i.e..  P„  = PC*PC,  where  [•]  denotes  the  convolution  operator  and  Px  is  the  coefficient 
sequence  of  resulting  polynomial.  After  solving  the  integral  function.  Eq.  3-16  becomes 

gii ,H-„  <»-”> 

where  n is  the  number  of  coefficient  of  Pac-  In  our  case,  n = 13. 

3.2. 1 .4  Codebook  training 

In  order  to  reflect  the  source  variation  caused  by  factors  such  as  stress  and  intonation, 
we  use  sentences  instead  of  sustained  vowels  for  training  the  glottal  codebook.  The  selected 
sentences  are:  (1)  '“We  were  away  a year  ago."  spoken  by  16  subjects,  and  (2)  “Early  one 

cases,  the  numbers  of  both  male  and  female  subjects  are  equal.  The  data  base  is  shown  in 
Table  3-1.  The  resulting  codebook  is  given  in  Table  3-2.  The  inclusion  of  nasals,  as  in  the 

attributing  zero  (anti-fomtant)  characteristics  to  the  source  model.  Although  the  set  of 
training  samples  does  not  consist  of  all  possible  voiced  sounds,  source  properties  arc  still 
considered  representative  since  the  supragloual  loading  effects  ore  removed  by  the  inverse 
filler  and  source  characteristics  are  presumably  the  only  remaining  ingredients. 


Table  3-1.  Database  for  codebook  naming. 


3.2.2  Unvoiced/Silcncc  Segments:  While  Noise 

For  simplicity,  we  treat  silence  as  unvoiced  speech  since  the  power  level  of  the 
silence  segments  is  so  low  that  any  modeled  errors  can  be  attributed  to  background  noise. 
Similar  to  the  idea  adopted  in  voiced  excitation,  a stochastic  codebook  is  used  as  the 
excitation  source  for  unvoiced  speech.  This  implies  that  the  residue  is  simulated  using  a 
finite  number  of  innovation  sequences  subject  to  a given  fidelity  criterion.  The  use  of  such 
innovation  sequences  is  motivated  by  the  CELP  coders,  of  which  the  stochastic  codebook 
has  been  known  to  produce  better  unvoiced  speech  than  voiced  speech  for  low-bit  coding 
(Schulthcib  and  Lacroix.1989).  But  in  contrast  to  the  fundamental  structure  of  the  CELP 

unnecessary  in  unvoiced  speech. 

Basically,  the  size  of  the  codebook  is  determined  by  three  factors,  namely,  the 


Table  3-2.  Content  of  glottal  codebook. 


C2  C,  Co 
-0.5999  0.0663  -0.0010 


103 


-0.0596 

-0.4829 


Note:  C denotes  the  ith  coefficient  of  the  polynomial. 


appropriate  criterion  for  characterizing  performance,  we  empirically  code  the  residue  for  a 
5 msec  duration  (SO  samples  at  10  kHz)  by  the  use  of 256  codewords.  The  type  of  codebook 
population  is  nol  a crucial  factor  from  a perceptual  point  of  view;  experiments  with 
Gaussian,  sparse  and  ternary-value  (-1,0, +1)  codcbooks  have  been  reported  to  produce 
similar  synthetic  quality  (Trancoso  et  al„  1990).  However,  since  the  probability  density 

generator  to  establish  the  codewords.  For  each  codeword,  special  formulations  of  its  content 
ate  only  for  the  purpose  of  reducing  the  computational  effort,  which  is  necessitated  by  the 
filtering  process  in  codeword  searching  (Kleijn  elal.,  1990;  Galand  et  al„  1992).  This  kind 

decomposition  (SVD),  frequency  domain  and  autocorrelation  approaches  (Trancoso  and 
Atal,  1990).  In  our  experiment,  the  autocorrelation  approach  is  adopted  to  facilitate  the 
computation.  Some  relevant  details  will  be  given  in  Chapter  4. 

a Gaussian  noise  generator,  but  we  employ  three  schemes  to  established  the  codebook: 
Scheme  1.  (64  entries)  — Each  codeword  contains  16  non-zero  samples. 

The  positions  of  non-zeros  samples  exhibit  a uniform  distribution 
from  1 to  50. 

Scheme  2.  (64  entries) — The  conditions  are  the  same  in  Group  1 except 
that  32  out  of  50  samples  are  non-zero. 

Scheme  3.  (12S  entries)  — Every  sample  is  taken  Trent  a Gaussian  noise 
generator. 

The  sparse  codewords  in  Schemes  1 and  2 arc  used  to  enhance  the  spiky  nature  of 
the  residue  so  that  the  stochastic  codebook  can  also  be  applied  to  synthesize  the  mixed  sounds 
as  well  as  plosives.  Thisconcept  is  very  similar  to  that  proposed  by  Kang  and  Everett  ( 1985), 
who  introduce  a few  spaced  spikes  into  the  unvoiced  excitation  in  order  to  obtatn  sausfactory 


plosive  sounds. 


CHAPTER  4 

SPEECH  ANALYSIS/SYNTHESIS/EVALUATION 


In  Chapter  2,  we  focused  on  how  lo  interpret  the  acoustic  features  of  speech  signals 
within  the  linear  source-filter  theory.  The  power  of  LP  techniques  for  performing  the  feature 
extraction  suggests  that  a high-quality  LP  synthesizer  could  be  achieved  if  these  features 
were  appropriately  modeled  and  accurately  estimated.  Hence,  in  Chapter  3 we  discussed 
source  modeling.  The  residue,  known  as  the  ideal  source  excitation,  was  simulated  either 
by  glottal  impulses  for  voiced  speech  or  innovation  sequences  for  unvoiced  speech.  Both 
types  of  excitations  were  further  formulated  into  two  specific  codebooks.  The  reader  can 
envisage  Chapter  2 as  an  anatomical  study  of  the  speech  signals  and  Chapter  3 as  an 
examination  of  the  glottal  source.  The  information  obtained  from  these  chapters  can  now 
assist  us  in  deriving  a synthesis  model  capable  of  producing  high-quality  natural-sounding 

In  this  chapter,  we  present  a new  model  which  includes  many  improved  features  such 
as  the  interpolation  of  LP  coefficients,  turbulent  noise  and  source-tract  interaction.  The 
parameters  in  this  model  arc  obtained  by  the  analysis-by-synthesis  procedure,  in  which  the 
analysis  denotes  the  process  of  estimating  the  parameters  that  characterize  the  speech  signal 
and  the  synthesis  denotes  the  process  of  replicating  the  speech  signal  by  controlling  and 
updating  these  parameters  under  the  supervision  of  the  speech  production  model.  We  will 
describe  our  methods  and  strategies  in  dealing  with  these  issues.  While  the  performance  of 
this  model  is  evaluated  by  judging  its  ability  to  produce  natural  speech,  we  also  discuss  the 
results  of  informal  listening  tests. 


Analysis  Scheme 


The  speech  production  model  employed  in  this  study  is  depicted  in  Figure  4-1. 
Except  for  the  excitation  source,  the  model  retains  the  basic  structure  of  the  pitch-excited 
LP  synthesizer.  In  addition  to  an  all-pole  filler,  other  parameters  required  by  this  model 
comprise  a voicing  decision,  voiced/unvoiced  gains,  codeword  indexes  and  Glottal  Closure 
Instants  (GCI's)  for  the  voiced  speech. 

In  general,  a pitch  synchronous  approach  is  preferred  for  speech  processing  not  only 
because  it  provides  better  formant  trajectories  (Krishnamurthy  and  Childers,  1 986)  but  also 
because  it  facilitates  the  synthesis  work.  To  implement  such  an  approach,  we  need  to  locate 
evety  GCI  accurate  before  computing  the  LP  coefficients.  The  difficulties  of  identifying 
GCI’s  complicate  the  feasibility  and  reliability  of  the  implementation,  making  the  pitch 
synchronous  analysis  practically  unattractive.  Thus,  we  decide  to  use  a frame-based  method 
to  compute  the  LP  coefficients,  but  carry  out  the  speech  synthesis  pitch  synchronously  after 
determining  the  pitch  period. 

Since  the  speech  signal  is  sampled  at  10  kHz,  a linear  predictor  of  1 3th  order  is  chosen 
to  account  for  the  spectral  characteristics  of  the  glottal  source  (3  poles)  and  vocal  tract  ( 10 
poles).  The  filter  coefficients  along  with  the  residue  are  derived  concurrently  using  an 
orthogonal  covariance  method  (Ning  and  Whiting,  1990).  performed  once  per  frame 
sequentially  throughout  the  input  speech.  The  frame  size  is  25  ms  with  an  overlap  of  5 ms 
between  any  two  consecutive  frames.  For  each  frame,  the  LP  gain  is  normalized  by  adjusting 
the  power  of  the  residue  to  that  of  the  speech  signal.  The  residue  in  the  overlapped  area  is 
obtained  by  weighting  the  forward  and  backward  overlapping  sequences  with  decreasing 
and  increasing  trapezoidal  windows  respectively  and  adding  them  together: 

ffl-SJifLru+iriT*®-  '-«•* » 


(4-1) 


Speech  Analysis 


Figure  4-1.  Proposed  speech  production  model. 


where  ej{i),  r%(i)  denotes  the  forward  and  backward  residue  signals  respectively.  e(n)  is  the 
resulting  residue  signal  for  the  overlapped  area  of  length  At. 

4.1.1  Orthogonal  Covariance  Method 

Consider  a digital  signal  with  the  following  sequence,  (si,  J2. sp 

The  linear  prediction  of  the  current  sample  is  described  as  a linearly  weighted  summation 
of  past  samples,  i.c., 

*»"  (4-2) 


where  the  u's  are  the  coefficients  of  the  LP  predictor  with  order  m.  and  the  e's  are  the 
prediction  errors.  Expressing  the  equations  above  in  a matrix  form,  we  have 


derivations.  We  define  the  Sk  as  the  *th  column  vector  of  the  matrix  S.  A as  the  vector  of 
the  LP  coefficients,  and  £ as  the  vector  of  prediction  error.  Thus.  Eq.  (4-3)  becomes 

[S,  S2  S3  • • - SJA  = Sm+1  - E.  (4-4) 

multiplying  the  pseudo  inverse  of  S on  both  sides  of  Eq.  (4-3).  It  may  be  shown  that  the 
obtained  result  is  the  same  as  that  derived  by  a covariance  method,  because  the  error  is 
minimized  over  a specified  interval. 


It  can  also  be  shown  that  a certain  degree  of  efficiency  could  be  gained  by 
rcfomtulating  the  foregoing  computation  as  follows.  Suppose  we  now  decompose  the  vector 
S/cti  into  k orthogonal  vectors  Vi’s  using  the  Gram-Schmidt  method.  The  set  of  the 
orthogonal  vectors  is 


Vt+,  = St+,  - (4-5) 


and  the  superscript  t denotes  the  transpose  operator.  Arranging  the  orthogonal  expansion 


V* 

S'] 

V!, 

4 e]  I . . . 0 0 0 0 

V, 

s', 

V» 

c'pcj  . . . 0 1 0 . . 0 

V* 

si 

cj,  cj, cJT‘  1 

Jt 

Through  several  algebraic  manipulations,  the  row  vector  S' 
(4-7)  can  be  shown  as 


(4-7) 


on  the  right  hand  side  of  Eq. 


where  C is  the  matrix  containing  the  cj's  on  the  right  hand  side  of  Eq.  (4-7).  |C^  is  a 
upper-left  submatrix  of  the  matrix  C with  a rank  p,  and  C^+(  is  the  (p+/)lh  row  vector  of 
the  matrix  C with  only  the  fust  p elements  included.  Compared  to  Eq.  (4-3).  it  is  found 


immediately  that  the  coefficient  vector  A is  equivalent  to  the  tom: 

A'  = Cj,+1[C]^p,  (4-9) 

and  the  vector  Vp+i  is  just  the  estimated  error  (or  the  residue)  of  thepth  order  LP  filter 

A major  advantage  of  using  the  orthogonal  expansion  is  that  the  matrix  inverse  of 
C can  be  achieved  by  a back-substitution  procedure  (Ning  and  Whiting,  1990).  Another 

argument  to  determine  the  order  in  many  methods.  In  speech  processing,  the  importance  of 
selecting  a correct  order  can  be  explained  in  terms  of  formant  charactensucs.  A filler  with 
a lower  order  tends  to  disregard  insignificant  formants  or  to  merge  two  adjacent  ones, 
whereas  a higher  order  filter  raises  the  possibility  of  producing  spurious  formants.  The 
resulting  incorrect  formants  may  lead  to  perceivable  errors  in  both  cases.  Thus,  a 
variable-order  predictor  is  always  preferred,  for  it  adapts  the  spectral  variation  of  running 
speech.  Apart  from  this  reason,  such  a model  can  also  reduce  the  transmission  bandwidth 
if  lower  orders  are  frequently  chosen. 

There  are  two  widely  accepted  order  estimators,  namely,  the  Akaike  information 
criterion  (AIC)  (Akaike,  1974)  and  minimum  description  length  (MDL)  criterion  (Schwarz, 
1978),  which  can  be  obtained  prior  to  estimating  the  LP  coefficients.  Two  other  methods 
proposed  in  the  recent  literature  are  the  Predictive  Least  Squares  (PLS)  (Wax,  1988),  and 
the  Iterative  Algorithm  of  Singular  Value  Decomposition  (IASVD)  (Konstantinidcs  and 
Yao,  1988).  A comparison  of  the  performance  of  the  four  methods  indicated  that  the  IASVD 
had  the  highest  success  rate  in  order  selection,  followed  by  MDL,  AIC  and  PLS 
(Konstantinidcs,  1991).  If  we  lake  into  account  computational  efficiency,  the  MDL  turns 
out  to  be  a proper  choice  to  work  with  the  orthogonal  covariance  method.  Eventually,  the 
i the  one  that  minimizes  the  MDL  function  given  by 


selected  order  is 


MDUi)  = N X In(  VjV,  ) + I X ln(W). 


(4-10) 


After  we  determine  the  optimal  order  p,  the  LP  coefficients  is  then  derived  from  Eq.  (4-9). 
4J.2-VflJ/S-ClaasiticaliQii 

Because  there  are  two  types  of  excitation  functions  in  the  proposed  model,  the  first 
step  toward  speech  analysis  is  a voicing  decision.  The  basic  principles  of  our  method  are 
rather  simple.  If  the  energy  of  the  underlying  signal  is  below  a specified  value,  the  signal 
is  classified  as  silence  (Campbell  and  Thomas.  1986;  Childers  ctal.,  1989a).  Otherwise,  we 
examine  its  spectral  till  by  calculating  the  first  reflection  coefficient.  The  signal  is  attributed 
to  voiced  speech  if  the  first  reflection  coefficient  is  larger  than  0.3.  Unvoiced  speech  is  the 
result  when  the  previous  two  tests  have  failed. 

Unlike  other  algorithms  the  conect  rate  of  classification  is  not  strictly  required 

misclassification  between  unvoiced  and  silence  is  not  critical  since  both  share  the  stochastic 
codebook  and  the  quantization  error  in  the  silence  can  always  be  ignored.  Also,  the  speech 
signal  with  a median  spectral  tilt  (e.g.,  the  first  reflection  coefficient  is  around  0.3)  often 


synthesis  since  both  are  performed  on  a pitch  period-by-pitch  period  basis.  Procedures  of 
the  GC1  identification  algorithm  can  be  summarized  in  two  steps:  (1)  pitch  estimation,  and 
(2)  peak  picking.  That  is.  we  determine  the  location  of  glottal  closure  after  estimating  the 

It  has  long  been  noted  that  the  sharp  peaks  in  the  residue  signal  generally  coincide 
with  the  GCI  for  a wide  variety  of  voiced  sounds.  Choosing  the  largest  peak  of  the  residue 


i of  both  types  of  excitations.  Therefore,  either  voiced  or 


, acceptable  for  synthesizing  such  a speech  signal. 


i of  the  GCI  is  essential  for  codeword  searching  and  speech 


signal  for  many  voic 
Hanauer,  1971;  Ana 


for  determining  the  GCI  (Alai 


hapadmanabha  and  Ycgnanarayana,  1979).  For  voices  that  arc 
not  rich  in  harmonic  structure  or  that  lack  distinctive  glottal  closure  may  fail  to  have  targe 
peaks  in  the  residue.  Furthermore,  the  true  peaks  may  be  obscured  by  other  spurious  peaks 
due  to  background  noise  and  modelling  errors.  Perhaps  the  easiest  way  to  circumvent  this 
drawback  is  to  apply  a lowpass  filter  to  reduce  the  influences  of  the  spurious  peaks. 
However,  too  much  smoothing  definitely  decreases  the  sharpness  of  the  teal  peak,  so  we  can 
only  eliminate  the  influence  of  noisy  components  to  such  an  extent  that  the  true  peaks  arc 
not  smeared  out.  To  avoid  the  phase  shift  of  the  peaks,  we  perform  the  lowpass  process  by 
a zero-phase  filler,  i.e.,  by  first  passing  the  residue  signal  forward  then  running  it  back 
through  the  same  filter  (Oppenheim  and  Willsky,  1983).  The  Z-domain  representation  of 

<«> 

Once  the  residue  signal  is  lowpass  filtered,  a segment  of  5 12  samples  centered  at  the 
current  frame  is  extracted  using  a harming  window.  This  windowed  segment  '(fi)  is  then 
transformed  to  a sequence  Ps  (rr)  similar  to  the  ceptnrm  by 

PM  = /FFTOmU(rr)])  (4-2) 

respectively,  and  1*1  denotes  the  magnitude. 

Like  the  pitch  estimation  procedure  outlined  in  the  ceptrum  method,  we  choose  m 
as  the  pitch  period  if 

Pirn)  > Pin)  n = 25,26 256.  (4-3) 


The  value,  m,cou!d  be  a multiple  of  the  real  pitch  period.  1 


check  is  given  as  follows.  We  first  look  for  the  position  1 of  the  largest  value  within  the  range 

[25.m-25).  i.c„ 

PAD  > PM)  for  / * n;  l,n  = 25 m - 25.  (4-4) 

If  the  following  condition  exists 

PAD  > -IP Am).  (4-5) 

then  l is  adopted  as  the  new  pitch  period.  Otherwise,  the  pitch  period  is  remained  as  ra. 

After  finding  the  pitch  period,  we  begin  with  the  search  for  the  largest  negative  peak 
in  the  smoothed  residue.  Due  to  that  the  peak  has  been  smeared  by  the  zero-phase  lowpass 
filter,  we  enhance  the  accuracy  of  peak  picking  by  approximating  the  curve  on  both  sides 
of  the  negative  peak  by  two  straight  lines  ranging  from  the  peak  value  to  one-third  of  this 
value  (highlighted  by  the  circled  area  in  Figure  4-2(a)).  The  intersection  of  the  two  lines 
is  chosen  to  be  the  first  GCI.  A small  interval  of  samples  (-4.5ms)  around  the  first  GC1  is 
used  as  a template  (as  shown  in  a dashed  box)  to  discriminate  other  peaks  within  the  same 
frame.  Peaks  located  before  or  after  this  GCI  with  approximately  one  pitch  period  are 
examined  by  computing  the  correlation  between  the  template  and  the  waveforms  around  the 
peaks.  Positions  that  lead  to  largest  correlations  are  then  selected  as  other  GCI's.  This 
procedure  continues  until  the  searching  range  is  out  of  the  current  frame  by  50  samples. 

The  overall  computation  above  costs  2 FFT's  and  several  comparisons.  An 
economical  approach  for  performing  the  whole  process  is  to  decimate  the  signal  s(n)  by  a 
factor  of  2 and  then  to  perform  an  interpolation  on  Ps(n)  to  counteract  such  a decimation. 
Because  of  the  lowpass  filtering,  the  foregoing  decimation  can  be  carried  out  by  choosing 
every  other  sample  of  P,(n)  without  causing  serious  aliasing. 

A complete  example  of  the  GCI  identification  is  illustrated  in  Figure  4-2(b).  In  this 


example,  it  appears  that  the  GCI  < 


(b) 


Figure  4-2.  Illustration  of  GC!  identification:  (a)  lowpass  filtered  residue  signal, 
(b)  cross  correlation  for  this  residue  signal  with  a template  shown  in  the 


jfGCIi 


are  not  always  sharp  and  distinctive.  The  tedious  work  with  regard  to  the  sm 


the  current  frame  is  redundant  but  necessary,  because  the  extra  information  can  be  used  to 
prevent  erroneous  GCI’s  in  frame  boundaries. 

4. 1 .4  Codeword  Searching 

Depending  on  the  voicing  conditions,  there  are  two  different  codebooks  prepared  to 
reconstruct  the  synlhcticexciiation.  Although  the  basic  idea  of  codeword  searching  for  these 
two  codebooks  is  the  same,  i.e.,  selecting  a optimum  codeword  that  achieves  a minimum 
error  subject  to  a distance  metric,  the  individual  implementations  arc  somewhat  different  due 


The  searching  process  for  the  optimal  glottal  codeword  requires  that  the  integrated 
residue  and  the  polynomial  waveform  are  of  the  same  length.  We  assume  the  maximum 
allowable  length  for  one  pitch  period  to  be  25.6  ms.  Thus,  if  we  encode  every  polynomial 
waveform  with  such  a maximum  length,  then  the  integrated  residue  of  one  pitch  period  can 
always  be  interpolated  to  the  maximum  length  using  the  FFT  method.  Taking  advantage  of 
the  symmetry  of  the  Fourier  transformation,  we  compute  the  correlation  coefficient,  i 
between  the  ith  polynomial  waveform,  gifn),  and  the  integrated  residue  of  the  mth  period, 
<&.(").  by 


plays  a role  in  helping  to  reduce  potential  errors.  Similarly,  the  acqui: 


(4-6) 


where  Dm(k)  is  ihc  FFT  sequence  of  the  interpolated  G-,(k ) is  the  FFT  sequences  of 

giW,  r*l  denotes  the  ceiling  function,  and  [']  denotes  the  complex  conjugate.  It  is  noted 
that  the  mean  values  of  g,(n)  and  uU(n)  are  zero  and,  therefore,  play  no  role  in  computing 
the  correlation  coefficient.  We  reflect  this  consequence  by  skipping  the  d.c.  term  during  the 
multiplication  of  two  FFT  sequences.  The  second  equality  in  Eq.  (4-6)  is  due  to  the  fact  that 
the  interpolated  FFT  sequence  of  Dm(k)  is  zero  when  k a [021 . The  spectrally  weighting 
filter,  which  is  commonly  used  in  the  CELP  coder,  does  not  participate  in  the  equation  above. 
This  is  because  our  distance  measure  is  applied  to  the  integrated  residue,  which  emphasizes 
only  on  the  glottal  phase  characteristics  at  the  low  frequency  region. 

Since  the  glottal  waveform  varies  relatively  slowly  compared  to  the  changes  of  the 
vocal  tract  transfer  function,  onecodeword  index  is  found  to  be  enough  to  describe  the  glottal 
excitation  for  each  voiced  frame.  We  further  define  the  cumulative  similarity  function,  H{i). 
as  the  sum  of  r/m(i)  along  one  frame. 


mo  = £ <ui) 


(4-7) 


i total  number  of  the  pitch  periods  in  this  frame.  The  codeword  i 


King  method  is 


that  commonly  used  in  CELP 
coders.  The  remainder  of  this  section  provides  a brief  discussion  of  the  CELP  algorithm  and 
the  autocorrelation  method  that  achieves  a fast  codeword  searching. 

4. 1.4.2. 1 CELP  algorithm 

The  CELP  algorithm  was  first  proposed  by  Schroeder  and  Atal  in  1984.  A rapid 
development  standardized  this  algorithm  in  the  late  1980’s.  The  CELP  coder  represents  a 
breakthrough  in  speech  coding  for  it  encodes  speech  signals  at  a rate  as  low  as  4.8  Kb/s  but 
still  produces  a satisfactory  quality.  The  basic  concept  for  the  class  of  CELP  coders  can  be 
viewed  as  a vector  quantization  technique,  which  passes  a finite  set  of  candidate  vectors 

specific  error  criterion.  However,  the  research  that  led  to  the  development  of  CELP  coders 

consists  of  a few  pulses  per  frame  regardless  of  whether  the  speech  is  voiced  or  unvoiced. 
The  locations  and  amplitudes  of  these  pulses  are  determined  by  minimizing  a subjective 
error  between  the  original  and  synthetic  speech  signals.  The  relationship  between  the  CELP 
and  MPLP  coders  can  be  understood  by  considering  the  multipulse  excitation  as  a 
deterministic  codcbook  consisting  of  innovation  sequences  (or  codewords),  each  consisting 
of  an  single  impulse  with  a different  delay.  Hence,  searching  for  an  optimum  pulse  location 
across  the  analysis  frame  is  equivalent  to  searching  through  a set  of  ensembles. 

In  the  primitive  CELP  coder  (Figure  4-3),  the  speech  signal.  s(n),  is  analyzed  in 
blocks  of  N samples.  For  each  block,  the  synthetic  speech  signal  is  derived  by  fitting  every 
innovation  sequence  stored  in  a codebook  into  two  recursive  filters  (long-term  and 
short-term)  with  a proper  scaling  factor.  An  error  signal  is  then  formed  by  comparing  the 
synthetic  speech  to  the  original  one.  Throughanexhaustivcsearchovertheentirecodcbook, 


Jing  factor)  that  produces 


the  innovation  sequence  (along  with  an  appropriate  seal 
minimum  mean-squared  subjective  error  is  selected  to  reconstruct  the  synthetic  excitation. 

The  short-term  predictor  in  the  CELP  coder  is  the  well-known  LP  fillet  The 
long-term  predictor  is  an  extra  stage  used  to  enhance  the  periodicity  of  the  synthetic  speech 
by  exploiting  the  similarity  across  consecutive  pitch  periods,  and  has  been  applied  in 
open-loop  and  closed-loop  form.  In  the  former  case  the  long-term  predictor  is  directly 
derived  from  the  residue  obtained  by  inverse  filtering  the  original  speech,  while  in  the  latter 
case  the  optimal  long-term  predictor  is  computed  based  on  an  analysis-by-synthesis 
procedure.  Although  the  analysis-by-synthesis  procedure  does  not  provide  much 
improvement  of  speech  quality  over  the  open-loop  procedure,  it  spawns  the  concept  of  the 
"adaptive  codebook”  or  “self-excile"  model,  in  which  the  codebook  entry  is  defined  as  the 
application  of  a moving  window  to  the  recent  past  excitation.  More  precisely,  each 
codeword  is  a shifted  version  of  the  previous  one  with  one  new  sample  changed  at  the  end. 
The  conceptual  structure  of  the  adaptive  codebook  is  illustrated  in  Figure  4-4.  As  seen  in 
this  figure,  the  function  of  the  pitch  predictor  is  replaced  by  the  adaptive  codebook.  Owing 
to  the  dependency  of  the  neighboring  codewords,  together  with  a relaxed  error  criterion  that 
provides  an  even  weighting  to  the  codewords,  fast  algorithms  have  been  exploited  to  reduce 
the  inherently  high  computational  complexity  of  closed-loop  procedure. 

Following  the  formalism  given  by  Trancoso  and  Atal  (1990),  we  now  use  the  matrix 
notation  as  well  as  vector  notation  to  illustrate  the  analysis-by-synthesis  procedure  for 

codeword  searching.  Given  a codebook  of  L sequences  C„(t)  (*=1,2 L)  each  of  length 

N,  the  filtering  operation  for  an  innovation  sequence  by  the  long-  and  short-term  fillets  can 

of  these  two  filters.  Written  in  matrix  form,  the  filter  output  for  the  Ath  codeword  can  be 
represented  by 


(4-8) 


y<*>  = yWtfc'* 


. niuslralion  of  Ihe  adaptive  codebook. 


where  y**1  is  the  scaling  factor  for  the  Irth  codeword,  H is  an  N x N matrix  with  the  element 
in  the  mth  row  and  the  nth  column  given  by  the  (m-n)th  sample  of  the  unit  impulse  response 
of  the  filter,  and  c1*1  is  a W-dimensional  vector  with  its  nth  component  given  by  Since 

/)„  = 0 for  n<0,  the  matrix  H can  be  shown  as 


(4-9) 


Let  us  define  x to  be  the  desired  signal  with  its  nth  component  given  by  xj,,  of  which 
the  memory  contribution  carried  over  previous  frames  has  been  removed  since  the  filter 
memory  plays  no  role  in  the  search  procedure.  The  total  squared  error  £®,  representing 
the  difference  between  the  desired  vector  x and  the  vector  y®,  is  defined  as 


£(*)  =||.r-  y®Hc®  |p,  (4_10) 

Y®  that  minimizes  E®  is  determined  by  setting  3E®/S y®=0,  yielding 


xWd*  l 
||  He®  |P' 


(4-11) 


rm  _ Ix'Hc®!2  — I,  i|2  |x*Hc®l2 


(4-12) 


The  best  codeword  is  obtained  by  selecting  the  index  k in  a exhaustive  search  for  which  the 
error  £®  is  minimum  or.  equivalently,  the  second  term  on  the  right  hand  side  of  Eq.  (4-12) 


In  principle,  the  error  derived  above  spans  over  the  entire  spectrum  of  the  synthetic 
speech.  Due  to  auditory  masking,  the  error  in  the  high  enetgy  regions  is  masked  by  the 
speech  signal,  suggesting  that  the  error  should  be  concentrated  in  the  formant  regions  to 
reduce  perceptual  distortion.  This  idea  can  be  easily  accomplished  by  the  use  of  a weighting 
filter  W(s)  that  attenuates  the  frequencies  where  the  error  is  perceptually  less  important  and 
amplifies  those  frequencies  where  the  error  is  perceptually  more  important: 


where  0.6<£j  <£/  s 1,  anda*  is  the  LP  coefficients.  l(tj  is  set  to  be  unity,  & in  the  range 
0.6<!j<0.9  gives  similar  subjective  results  in  informal  listening  tests  (Rose  and  Barnwell, 
1990). 

Referring  to  Eq.(4-12),  the  computation  required  in  the  codeword  searching 
contains  only  two  terms,  namely,  a cross  correlation  term  between  vectors  x“H  and  cftK  and 
an  energy  term  corresponding  to  the  filtered  output  Wc1 41  of  each  codeword.  The  energy  term 
is  computationally  complicated  if  the  matrix  multiplication  is  directly  performed. 
Fortunately,  many  methods  have  been  proposed  for  avoiding  the  time-consuming  matrix 
multiplication.  We  will  now  discuss  the  autocorrelation  method,  which  is  very  efficient  for 
fully  populated  codebooks  and,  therefore,  an  excellent  choice  for  our  stochastic  codebook. 

4,1, 4.2.2  Autocorrelation  method 

Let  us  first  consider  the  energy  term  in  the  second  part  on  the  right  side  of  Eq.  (4-12). 
Recall  that  we  already  dropped  the  long-term  predictor  in  our  model.  The  hm  only  represents 
the  sequence  of  the  impulse  response  of  the  short-term  predictor  filler.  We  rewrite  the  energy 


(4-14) 


Making  use  of  the  fact  that  the  sum  of  the  squares  of  the  convolution  of  two  sequences  equals 
the  cross  correlation  of  the  autocorrelations  of  these  two  sequences,  Eq.  (4-14)  can  be 


= *a(0)*?>(0)  + 2 2 j**©*?® 


(4-15) 


= 


Z 


(4-16) 


= *Z^e?,<w 


(4-17) 


For  Eq.  (4-15)  to  be  held,  the  convolution  cannot  be  truncated,  requiring  that  the  impulse 
response  of  the  synthesis  filter  is  effectively  zero  beyond  the  N sample.  In  most 
circumstances  this  requirement  will  be  satisfied  after  imposing  the  spectral  weighting  filter. 
If  we  further  define  the  cross-correlation  between  An  and  % by 

JWO-  Z (4-18) 


Eqs.  (4-11)  and  (4-12)  are  transfo 


(4-19) 


£«  =|xf — -j-p , (4-20) 

Rh(.0)R^\0)  + 2 £ JWMfW 

respectively. 

From  above  derivation,  il  is  easily  seen  that  Ibis  method  contributes  substantial 
savings  in  computation  time.  The  energy  term  can  now  be  computed  with  just  N 
multiplications  per  codeword.  However,  the  price  we  have  to  pay  is  the  storage  of  an 
additional  codebook  with  the  autocorrelation  coefficients  of  the  original  codewords. 


4.2  Synthesis  Scheme 

Speech  synthesis  is  the  procedure  of  reconstructing  speech  signals  by  controlling  and 
updating  the  parameters  of  a speech  production  model  estimated  in  speech  analysis.  The 
synthesis  of  unvoiced  speech  is  straightforward  and  can  be  easily  accomplished  by  exciting 
the  time-varying  all-pole  filler  with  the  gain-adjusted  innovation  sequence  sequentially.  On 
the  other  hand,  the  synthesis  of  voiced  speech  is  rather  complicated  because  we  have  to 
construct  the  synthetic  excitation  from  the  gross  features  of  glottal  phases.  Therefore,  most 
of  this  section  is  focused  on  the  synthesis  schemes  for  voiced  speech. 

Despite  many  control  parameters  for  voiced  speech  were  estimated  on  a 
frame-by-frame  basis,  the  corresponding  synthesis  can  still  be  carried  out  pitch 
synchronously  provided  that  the  control  parameters  arc  properly  interpolated  for  each  pitch 


' the  interpolation  will 


period.  In  this  section,  we  start  from  the  discussion  of 
glottal  phase  and  LP  coefficients.  Then,  we  present  a method  for  eliminating  the  spectral 
tilt  of  the  glottal  pulse.  Effects  of  vocal  noise  and  source-tract  interaction  are  discussed 


4,2.1  Interpolation  of  Glottal  Phase 

As  mentioned  earlier,  only  one  codeword  is  employed  to  indicate  the  glottal  phase 
characteristics  for  each  frame.  Although  large  discrepancies  may  occur  between  any  two 
adjacent  pitch  periods  in  one  frame,  a progressive  alteration  of  the  glottal  pulse  still  sounds 
reasonable  because  such  discrepancies  are  already  reflected  in  the  codewords  of  different 
frames.  This,  however,  results  in  the  discontinuities  of  the  glottal  phase  characteristics  at 

therefore  apply  a lowpass  filter  to  eliminate  the  rapid  changes  of  the  polynomial  coefficients 
as  follows: 

(1)  DR  filter: 


the  polynomials  for  the  previous,  current  and  next  frames,  respectively.  In  our  program,  an 
11R  filter  with  the  value  of  a = 0.5  is  used  since  it  works  well  in  our  experiments.  We  recall 
that  the  resulting  polynomial  has  to  satisfy  the  three  constraints  specified  in  Section  3.2.1. 
Therefore,  the  condition  where  the  sums  of  the  coefficients  on  the  right  hand  sides  of  Eqs. 
(4-21)  and  (4-22)  arc  unifies  is  to  comply  with  the  fust  constraint  However,  no  extra 
consideration  is  necessary  for  the  other  two  constraints. 


, a complete  procedure  for  generating  a glottal  impulse  is  given. 


P(k)  = (l-a)P  (*)  + aPJfc). 


(4-21) 


(2)  HR  filter 


where  P (A)  is  the  polynomial  for  the  ith  pilch  period. : 


, and  /ViM.  PM  and  />„♦,(«  are 


As  in  the  case  of  the  glottal  phase  characteristics,  the  LP  coefficient  extracted  from 
the  frame-based  method  may  exhibits  unreasonable  discontinuities  at  frame  boundaries.  A 
simple  solution  will  be  to  linearly  interpolate  the  LP  coefficients.  However,  synthetic  speech 
produced  by  this  method  may  sound  too  smooth  for  speech  segments  with  a rapid  spectral 
transition.  The  plosives  are  typical  examples  that  suffer  the  drawback  of  the  linear 
interpolation.  This  suggests  that  the  interpolation  of  the  LP  coefficients  should  be 
"piece-wise  continuous."  Thus,  we  adopt  a quadratic  weighting  function  ny  to  interpolate 
the  LP  coefficients: 

1 

(In.  - 1./2  - / LI + In.  - U2  -ill) 

w,  = — ! ii— . i = - 1.0. 1 (4-23) 

I 1 ! 

i-->  (in,— /,/2  —j  l) + In,— /,/2 -j  If) 

where  1*1  denotes  the  absolute  value,  ns  and  «,  denotes  the  positions  of  the  beginning  and 
ending  points  of  the  current  pitch  period,  and  l/is  the  number  of  samples  in  each  frame.  The 
vector  of  the  interpolated  LP  coefficients,  Aiew.  is  obtained  by 

.*««.  = <4-i>V-i  +A0w0  + A,w,  (4-24) 

respectively.  Figure  4-5  illustrates  the  linear  and  quadratic  interpolations  for  an  arbitrary 
coefficient.  What  we  mean  by  "step-wise  linear"  is  clearly  delineated  by  the  quadratically 
interpolated  curve. 

One  of  the  disadvantages  of  such  an  LP  interpolation  is  that  it  occasionally  moves 
poles  outside  the  unitciicle,  implying  that  we  have  to  reflect  these  outside  poles  into  the  unit 
circle  in  order  to  stabilize  the  synthesis  filter.  However,  we  do  not  consider  this  to  be  a serious 
problem  since  the  interpolation  can  also  be  done  using  reflection  coefficients, 
autocorrelation  functions,  cross-sectional  areas,  for  all  of  which  the  stability  criterion  is 


Figure  4-5.  Interpolation  with  respect  to  one  of  the  LP  coefficients  along  several 
frames,  (dotted  line:  without  interpolation;  solid  line:  with  linear  interpolation; 
dashed  tine:  with  proposed  quadratic  interpolation) 


satisfied.  Moreover,  the  proposed  LP  interpolatic 


viththe 


filter.  Apparently,  the  deadlock  has  to  be  resolved  by  inventing  a transformation  that  is 
capable  of  performing  interpolation  with  a different  dimension.  Unfortunately,  we  do  not 
have  an  appropriate  method  for  solving  this  problem.  For  this  reason,  a fixed-order  filler 
still  serves  in  this  study. 

4,2,3  Speeiral.EliUness 


In  onler  to  meet  the  spectral  specification  of  the  residue,  any  source  model  for  the 
LP  synthesizer  should  have  a fiat  spectrum.  Our  method  for  achieve  the  spectral  flatness  of 
the  glottal  excitation  is  inspired  by  the  appearance  of  the  integrated  residue,  in  which  the 
pulse  swings  around  the  glottal  closure  contribute  most  of  the  high  frequency  energy.  Our 
formulation  is  given  as  follows: 

First,  we  modify  the  third  sample,  g(3),  of  the  modeled  polynomial  waveform  of  the 
integrated  residue  so  as  to  ensure  the  existence  of  a sharp  pulse: 

g(3)  = (1  + max(g(n)l  n = 1.2 [7/41 ))  * .92  - 1 (4-25) 


the  pitch  period  of  I samples. 


(4-26) 


(4-27) 


(using  the  new  value  of  g(3))  so  that  energy  at  the  middle  frequencies  is  enhanced.  The 
excitation  pulse  is  obtained  by  taking  the  differentiation  of  g(n).  Finally,  a first  order  inverse 
filter  is  applied  to  remove  the  spectral  tilt  of  the  excitation  pulse.  The  resultant  excitation 
is  denoted  as  the  glottal  impulse,  which  will  be  frequently  seen  in  the  rest  of  this  dissertation. 


4.2.4  Effect  of  Vocal  Noise 


Vocal  noise  is  important  for  synthesizing  breathy  and  female  voices  (Klatt,  1987; 
Pinto  et  al.,  1989;  Klatt  and  Klatt.  1990;  Childers  and  Lee.  1991).  In  Chapter  2,  we  have 
shown  that  the  extracted  noise  exhibits  the  following  two  features.  First,  the  noise. 

magnitude  of  such  noise  near  the  glottal  closure  is  higher  than  that  at  the  other  places. 

As  we  also  pointed  out,  part  of  the  vocal  noise  possibly  resulted  from  the  phase 
misalignment  In  older  to  verify  this  possibility,  we  decide  to  run  a simulation  using  a noise 
source  that  has  a larger  amplitude  around  the  glottal  closure.  The  noise  is  produced  by 
modulating  uniformly  distributed  white  noise  with  a Gaussian  window  given  by 

»■,<■>-  L™  <“*> 

where  /is  the  pitch  period,  [*J  denotes  the  floor  function.  By  referring  to  the  measurements 
in  Chapter  2,  we  choose  o as  0.2S  and  B as  0.5  to  approximate  the  amplitude  modulation  of 
the  vocal  noise  for  normal  male  subjects.  While  adding  this  noise  to  the  synthetic  excitation, 
the  amplitude  of  the  vocal  noise  is  adjusted  to  achieve  a Signal-to-Noise  Ratio  (SNR)  of 
25dB. 

For  the  sake  of  comparison,  we  also  test  another  type  of  vocal  noise  with  a constant 
modulation,  of  which  die  level  is  measured  at  the  middle  between  two  glottal  closure 
instants.  Furthermore,  we  adopt  a 60%  duty  cycle  starting  at  the  maximum  glottal  closure 
since  it  was  preferred  in  a listening  evaluation  (Childers  and  Lee,  1991).  The  SNR  is 
modified  as  28  dB  to  meet  the  measured  level. 

4.2.5  Sourcc-iracl  line  faction 

high-quality,  natural-sounding  speech  (Yea  et  a!., 1983;  Allen  and  Strong,  1985).  In  order 


lo  develop  a comprehensive  source  model  for  speech  synthesis,  this  particular  effect  cannot 

a vocal  system  to  control  the  glottal  impedance  or  by  incorporating  an  equivalent  effect  into 
a source  model.  Our  approach  falls  in  the  latter  categoiy. 

Two  major  effects  in  the  glottal  flow,  namely,  skewness  and  formant  ripples,  result 
from  the  source-tract  interaction  (Rothcnberg,  1981;  Fant  and  Ananlhapadmanabha,  1982). 
The  skewness,  in  general,  varies  at  a relatively  slow  rate.  Our  glottal  impulse  model  is 
expected  to  imitate  the  skewness  with  adequate  precision.  The  formant  ripples,  on  the 
contrary,  are  too  subtle  to  be  depicted  by  this  model.  Due  to  the  lack  of  accurate  estimation, 
we  only  present  methods  for  imitating  the  formant  ripples  rather  than  direct  modeling. 

Since  the  ripple  effect  is  associated  with  an  increase  in  the  formant  bandwidth  during 
the  glottal  open  phase,  similar  results  can  be  achieved  by  moving  the  poles  of  the  LP  filter 
inward  or  outward  as  we  did  for  the  spectral  weighting  filter  in  Section  4. 1 .4.2.  The  damping 
of  an  all-pole  filter  can  be  controlled  by  multiplying  the  corresponding  coefficients,  at's.  by 
the  powers  of  a factor  a,  i.e.,  a,  = a,ak  (Vtswanathan  and  Makhoul,  1975;  Tohkura  et  al., 
1978).  A value  of  a smaller  than  1 will  move  the  poles  toward  the  origin  and  broaden  the 
bandwidths  of  the  poles.  However,  the  opposite  statement  is  not  necessarily  true  when  a is 
greater  than  1 . This  is  because  the  bandwidths  are  reduced  only  if  the  poles  are  moved  closer 

One  possible  implementation  for  the  ripple  effect  is  to  use  two  sets  of  LP  coefficients 
to  simulate  the  damping  factors  for  two  different  glottal  phases.  We  apply  the  normal  LP 
coefficients  during  the  glottal  closed  phase  and  switch  to  the  LP  coefficients  with  a larger 
damping  (i.e.,  a<l)  when  the  vocal  folds  are  open.  As  illustrated  in  Figure  4-6(b),  the 
damping  of  the  formant  energy  in  the  synthesized  vowel  lil  is  quite  evident 

In  case  the  true  bandwidths  are  narrower  than  those  estimated  by  LP  analysis  (Wong, 
1980),  we  instead  decrease  the  damping  of  the  close  phase  to  compensate  for  the  reduced 
bandwidths.  Of  course,  this  work  can  be  accomplished  by  using  a filter  with  a>  1,  but  the 


implementauon  of  such  a filler  requires  a priori  knowledge  about  the  locations  of  the  poles. 
This  calls  for  a root-solving  routine  which  is  computationally  expensive  and  very  sensitive 
to  the  quantization  errors.  Thus,  an  alternative  solution  that  we  adopt  in  this  research  is  to 

synthesized  speech  has  a similar  ripple  effect.  The  filter  is  given  by 


when:  the  values  for  a and  (1  are  0.8  and  0.7,  respectively. 

Although  W(z)  takes  effect  only  at  the  glottal  closed  phase,  the  filter  memory  carried 
over  from  the  previous  frame  has  to  be  taken  into  account.  To  reflect  the  contribution  of  filter 
memory,  we  impose  a virtual  constraint  that  the  glottal  impulses  for  any  two  consecutive 
periods  are  the  same.  Thus,  filter  memory  can  always  be  derived  from  the  present  glottal 
impulse  instead  of  referring  to  the  filtering  results  of  the  previous  frame.  By  taking 
advantages  that  the  duration  of  the  glottal  closed  phase  is  usually  less  than  one-half  of  the 
pitch  period  for  modal  voices  and  that  the  memory  depth  is  limited  by  the  filler  order,  we 
can  perform  the  mcmoiy  recovery  process  together  with  the  ripple  effect  by  filtering  a 
circularly  shifted  glottal  impulse  with  the  glottal  closure  in  the  center  (sec  Figure  4-7(5)). 
In  this  manner,  the  filter  regains  its  memory  during  the  glottal  open  phase  and  distributes  the 
memory  influence  to  the  excitation  approximately  during  the  glottal  closed  phase. 

To  smooth  the  above  process,  we  apply  a hanning  window  to  this  circularly  shifted 
glottal  impulse.  The  windowed  component  is  fed  into  the  filter  W(z)  to  yield  the  intended 
damping.  On  the  other  hand,  the  remaining  part  (obtained  by  subtracting  the  windowed 
component  from  the  excitation  pulse)  is  kept  unchanged  throughout  the  course  of  filtering 
operation.  Because  W(z)  is  an  HR  filter,  we  append  zeros  to  the  windowed  component  to 
form  a length  of  one  and  half  pitch  periods  such  that  the  filter  is  able  to  release  its  energy 


W(z)  = 


(4-29) 


ufficiently. 


additional  half  period  to  that  in  the  fust  half  period.  The  glottal  impulse  with  the  source-tract 
interaction  feature  is  formed  by  accumulating  the  filtered  and  remaining  components 
together.  Figure  4-6(c)  shows  the  synthesized  data  produced  by  using  such  a glottal  impulse. 
In  this  figure,  the  damping  of  the  presented  speech  signal  is  obviously  reduced  in  the  glottal 
closed  phase. 


To  generate  a glottal  impulse,  the  recommended  order  for  the  implementation  of  the 
spectral  flatness,  vocal  noise  and  source-tract  interaction  is  given  as  follow: 

2.  modify  g(3).  g(4)  and  g( 6)  by  using  Eqs.  (4-25).  (4-26)  and 
(4-27),  respectively. 

3.  add  white  noise. 

4.  differentiate  the  waveform  to  yield  an  excitation  pulse. 

6.  remove  the  spectral  tilt  of  the  excitation  pulse  by  inverse  filtering. 

Figure  4-7  illustrates  the  procedures  above. 


along  an  utterance.  Although  the  gain  is  an  important  factor  affecting  synthesis  quality. 


methods  for  calculating  the  gain  and  their  influences  on  synthetic  speech. 


e-filtcr  speech  production  model  (Fant,  I960),  the  filter 


For  a si 
decomposed  in 


i components:  one  results  from  the  i 


u(n),  and  the  other 


(5.  Continued.) 


(5.  Continued.)  1 6.  Remove  spectral  tilt  | 


uies  of  generating  a glottal  impulse. 


rcsulls  from  the  filler  memory,  q(n).  According  lo  such  n structure.  a superposition  method 
is  often  adopted  for  speech  synthesis  in  to  the  avoid  transient  discontinuities  at  the 
boundaries  of  pitch  periods  (Nferhelstand  Nilens.  1986).  That  is,  foreach  pitch  period  there 
are  two  synthesis  filters  employed:  the  one  holding  the  previous  LP  coefficients  is  in  charge 
of  the  memory  contribution,  and  the  other  possessing  the  new  LP  coefficients  is  responsible 

Suppose  we  insist  that  the  power  of  the  filler  output  has  to  equal  that  of  the  original 
signal.  Given  a speech  segment  s(n)  of  M samples  with  power  P,: 


Atal  and  Hanauer  (1971)  derived  the  gain,  Ag,  by  solving  the  following  equation  directly: 


In  case  Ag  was  negative  or  complex,  they  set  Ag  as  zero.  The  reason  for  causing  such  a zero 
setting  is  because  the  power  contributed  by  the  filter  memory  is  too  much.  It  appears  that 
the  zero  setting  is  just  a strategy  to  let  the  filter  mcmoiy  die  out  so  that  the  gain  can  resume 
its  function.  Although  this  zero  setting  seems  the  only  solution  in  order  to  make  the  synthesis 
implementablc,  it  definitely  destroys  the  pitch  harmonics  of  the  synthetic  speech. 

Tohkura  et  al.  (1978)  suggested  that  the  memory  contribution  was  negligible  when 
the  filter  response  was  sufficiently  damped.  Thus,  after  increasing  the  damping  factor  of  the 
synthesis  filter,  they  computed  the  gain  without  considering  the  filter  memoiy,  i.e„ 


In  consequence,  the  elimination  of  the  zero  setting  is  at  the  prise  of  possible  errors  due  to 
the  ignorance  of  filter  mcmoiy. 


(4-30) 


(4-31) 
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Makhoul  (1975),  on  the  other  hand,  computed  the  gain  on  the  level  of  the  driving 
function.  By  assuming  that  the  excitation  was  either  an  impulse  train  for  voiced  speech  or 
white  noise  for  unvoiced  speech,  both  of  which  have  unity  power,  he  computed  the  gain  by 
estimating  the  power  of  the  residue: 


when:  R(k)  is  the  autocorrelation  function  of  the  analyzed  speech  signal,  and  nt's  are  the  LP 
coefficients.  Because  the  gain  is  a by-product  of  the  LP  analysis,  this  method  seems  very 
elegant  and  straightforward.  However,  it  leads  to  the  following  problem.  Unlike  the 
impulse,  the  residue  signal  does  not  have  a absolutely  flat  spectrum.  Small  mismatches 
between  the  impulse  and  the  residue  at  low  frequencies  may  be  amplified  after  imposing  the 
synthesis  filter.  Often  limes,  the  resultant  enors  arc  manifested  as  energy  fluctuations  in  the 
synthetic  speech,  and  a warble-like  quality  will  be  perceived. 

In  CELP  and  MPLP  coders,  the  part  attributed  to  the  filter  memory  is  first  removed 
form  the  analyzed  speech  signal  (Trancoso  and  Alai,  1990;  Rose  and  Barnwell,  1990).  The 
gain  is  determined  by  the  cross  correlation  between  the  spectrally  weighted  speech  signal 

automatically  compensate  for  the  resultant  error  when  the  synthesized  speech  signal  does  not 
match  the  original  very  well.  Unfortunately,  such  an  approach  does  not  suit  our  model 
because  the  best  fit  of  the  glottal  impulse  may  still  result  in  a large  discrepancy  between  the 
original  and  synthetic  speech  signals. 

From  the  above  discussions,  we  see  that  the  gain  is  used  to  regulate  the  amplitude 

pertaining  to  the  amplitude  of  the  speech  signal.  To  avoid  the  above-mention  drawbacks, 
we  demonstrate  a method  below  to  retrieve  the  gain  from  P„ 


It  is  noted  that  the  speech  waveforms  in  many  adjacent  pitch  periods ; 


are  very  similar. 

suggesting  that  the  initial  and  linal  filter  memory  arc  nearly  equal  in  most  cases.  Thus,  the 
filter  memory  can  be  approximated  via  the  filtering  operation  with  the  use  of  the  same 
excitation.  If  the  number  of  pitch  period,  m.  of  the  underlying  excitation  be  large  enough 
(say  m=5),  the  filter  memory  contributed  before  the  first  period  ate  negligible.  Therefore, 
the  zero-startup  filter  response  in  the  last  period  can  be  regarded  as  a complete  filter  output. 
Two  examples  of  such  a filtering  operation  arc  illustrated  in  Ftgure  4-8.  The  gain  Ag  for  the 
excitation  pulse  is  then  calculated  by 


where  s/.k)  is  the  resultant  filler  output  within  the  mth  pitch  period,  and  / is  the  length  of  the 
pitch  period. 

The  derivation  described  above  needs  a large  amount  of  filtering  operations.  An 
algorithm  presented  in  the  following  is  provided  to  alleviate  the  computational  burden. 
Notice  that  the  filter  memory  from  the  past  frame  is  always  accessible  during  the  speech 
synthesis.  We  can  simulate  the  foregoing  filtering  process  by  referring  to  the  actual  filter 
memory.  Suppose  the  filter  is  implemented  using  a dircct-I  form  structure.  The  filter 
memory  is,  therefore,  represented  by  the  ending  samples  of  the  previous  frame.  As  seen 
more  obviously  in  Figure  4— 8(b),  it  is  the  deviation  of  the  filter  memory  that  retains  the 
similarity  across  all  the  periods.  Based  upon  this  observation,  we  separate  the  filter  memory, 
s(-*),  into  two  parts:  the  mean  value,  %(-k),  and  the  deviation,  jj(-k|: 


i *(  - *)  = £ X s(  - *>  for*  -1,2 p;  <4— 35) 

sj,~  k)  = s(-  *)  - s„(-  *)  for  * = 1,2 p.  (4-36) 


Figure  4-8.  Zero-sianup  responses  for  I 


We  simulate  the  filtering  process  by  means  of  an  iterative  procedure,  which  is  governed  by 
a constant  variance  of  Sj(-k).  That  is,  in  each  iteration  we  calibrate  the  amplitude  of  the  filter 
memory  according  to  the  mean  value  and  deviation  of  the  previous  results  by 

s%)  = «<»>  + +/>)- t/("»j  - « 

(n  - 1.2 0 (4-37) 

where  u(n)  is  the  filter  response  of  the  current  excitation,  £/(n)  is  the  step  function,  and  the 

p are  adjusted  by 
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We  start  with  the  iteration  from  the  zero-startup  response  using  the  selected  glottal  impulse. 
After  proceeding  this  iterative  procedure  several  times,  the  resultant  signal  will  approach  to 
the  one  derived  by  filtering  consecutive  glottal  impulses.  The  gain  Ag  is  calculated  as 

As  = I ,P'  ■ 

* / ' (4-40) 
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The  obtained  Ag’s  sometime  exhibit  large  variations  among  adjacent  pilch  periods. 
This  may  cause  perceivable  energy  fluctuation  for  synthetic  speech.  Unfortunately,  a 
smoothing  procedure  based  on  Ag ’s  cannot  remedy  such  a defect  because  it  docs  not  take  the 
filter  gain  into  account  We  solve  the  problem  by  introducing  a new  variable,  Q.  which  is 


(4-38) 


(4-39) 


defined  as  the  square  root  of  the  proportion  of  the  power  emerging  from  the  current 


Then  it  is  reasonable  for  us  to  argue  that  the  obtained  p's  vary  slowly  during  voiced  speech. 
We  therefore  apply  a first  order  UR  lowpass  filter,  0.3/(l  - 0.7j-1),  to  Q’s  to  prevent 

determined  by 


Eventually,  with  the  use  of  the  proposed  algorithm,  the  time-consuming  convolution 
necessitated  by  the  filtering  operation  is  replaced  by  multiplications  and  additions.  More 
important,  this  algorithm  prevents  the  drawbacks  occurred  in  other  methods. 


As  menuoned  in  Sccuon  4. 1.4.2,  we  have  adopted  the  CELP  algorithm  to  reconstruct 
the  unvoiced  excitation.  The  gain  A„  is  the  scale  factor  y corresponding  to  the  optimum 
codeword.  While  the  optimal  y provides  the  minimum  error,  it  often  lowers  the  power 
intensity  of  synthetic  speech.  Hence,  as  recommended  by  Zinser  and  Koch  ( 1 989),  the  gain 
A„  is  better  replaced  by  a power  match  between  the  input  signal  x(n)  (signal  with  the  memory 
contribution  removed)  and  the  filtered  response  of  the  synthetic  excitation  y(n),  ie.. 


(4-41) 


(4-43) 
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is  the  length  of  the  subframe.  In  our  case. 


Similar  to  many  LP  synthesizers  that  use  a multi-mode  excitation  source,  an 
inappropriate  voicing  decision  may  lead  to  the  deterioration  of  synthetic  quality.  The 
problem  becomes  serious  during  the  voiced/unvoiced  transition  since  the  pilch  estimation 
is  prone  to  error  at  this  region.  The  deficiency  related  to  the  pitch  estimation  can  be  alleviated 
by  applying  a median  filter  or  error  correction  method  to  the  pitch  contour  so  that  pilch 
halving,  doubling  as  well  as  other  deviating  results  can  be  avoided.  To  ameliorate  the 
problem  of  a strict  voicing  decision,  we  propose  a method  to  smooth  the  voicing  transition 

Consider  a voiced  segment.  sAfi,  that  is  next  to  an  unvoiced  segment  If  the  voiced 
segment  is  ahead  of  an  unvoiced  segment,  then  we  can  gradually  change  the  voicing  model 
by 


where  s(i)  is  the  resultant  speech  signal.  N/  is  the  frame  length,  and  $„(„)(/)  is  an  alternative 
version  of  sv(0  synthesized  by  using  the  innovation  sequences.  If  the  voiced  segment  is 
located  behind  an  unvoiced  segment  s(i)  becomes 


r = 1,2,3,. ...fy  (4-44) 


s (0  = 


.2.3 Nf  (4-45) 


i our  program.  Nonetheless. 


For  simplicity.  st>(«)(i)  is  derived  at  the  analysis  stage  in 


also  be  done  at  the  synthesis  stage  when:  the  CELP  analysis-by-synthesis  procedure  is 
brought  in  to  resynthesize  sv(i)  produced  using  the  glottal  impulse. 

4.4.Suhirctive  Quality  Evaluation 

Like  many  other  waveform  coders,  the  aim  of  the  proposed  source  model  is  to  extract 
important  features  that  are  not  modeled  by  the  LP  filter.  However,  it  is  important  to  point 
out  that  although  our  synthetic  speech  waveform  is  very  close  to  the  original,  we  do  not  apply 
any  closed-loop  waveform-matching  criterion  nor  a spectral  weighting  function  while 
synthesizing  voiced  speech.  It  appears  that  the  subjective  measure  on  the  basis  of  segmental 
SNR  is  not  appropriate  to  indicate  the  quality  of  synthetic  speech.  For  this  reason,  we 
conducted  informal  listening  tests  to  assess  the  performance  of  the  proposed  source  model 
as  well  as  the  LP  speech  synthesizer. 

In  addition  to  (he  training  sentences,  two  other  sentences  have  been  tested,  namely, 
‘That  zany  van  is  azure"  and  "Should  we  chase  those  cow  boys."  The  speech  tokens  included 
those  uttered  by  speakers  not  in  the  training  group.  It  was  found  that  the  quality  of  the 
synthetic  speech  was  very  close  to  that  of  the  original  speech.  If  the  recorded  speech  was 
played  back  by  loudspeakers  in  an  A-B  test,  listeners  found  it  difficult  to  discriminate  the 
synthetic  speech  from  its  original  counterpart.  For  speech  tokens  in  which  pitch  contours 
were  identical,  the  probability  was  approximately  one-third  that  the  synthetic  speech  were 
preferred  over  the  original  speech. 

To  acquire  a more  critical  view  of  the  excitation  model,  the  listening  tests  were  also 
carried  out  using  a high-quality  headphone  (Sony,  MDR-V6).  It  was  revealed  that  our 

and  the  inverse  filter  could  not  fully  counteract  such  a tendency.  As  a result,  the  synthetic 
speech  was  judged  to  be  slightly  bassy.  However,  when  we  increased  the  order  of  the  inverse 
filter,  the  synthesized  quality  became  crisper.  Because  such  a crispy  quality  was  not  always 
preferred  by  listeners,  we  did  not  consider  the  increased  order  as  an  acceptable  amelioration. 


s filler,  both  nois 


In  contrast  to  the  finite  achievement  of  the  inverse 
interaction  were  more  likely  to  be  responsible  for  the  improvement  of  synthetic  quality.  The 
addition  of  noise,  in  general,  reduced  the  metallic  attribute  of  synthetic  speech.  However, 
the  use  of  a different  amplitude  modulation  did  not  affect  the  speech  quality  largely.  This 
is  probably  because  the  noise  power  for  the  modal  voices  is  too  low  to  result  in  any  significant 
difference.  The  quality  improvement  due  to  the  incorporation  of  source-tract  interaction  was 
noticeable  in  our  experiments.  We  reason  that  as  the  combined  result  of  raising  formant 
resonances  and  attenuating  the  inter-formant  components  of  the  glottal  impulse,  which 
contributes  the  dispersiveness  of  the  formant  ripples  and,  in  turn,  reflects  the  fact  that  the 
residue  is  somewhat  intelligible.  From  the  view  of  spectral  shaping,  the  resultant  effect  of 
formant  ripples  is  considered  the  same  as  the  amplitude  spectrum  modification  introduced 
by  Kang  and  Everett  (1985)  and  the  adaptive  postfilter  suggested  by  Chen  and  Gerso  ( 1987). 
For  this  reason,  a backward  filtering  operation  that  disperses  the  formant  ripples  ahead  of 
the  glottal  closure  is  also  recommended.  In  order  words,  the  W(z)  that  we  used  to  modify 
the  excitation  pulse  can  be  a zero-phase  filter. 

From  the  listening  test,  it  was  also  found  that  modification  with  respect  to  g(3),  g(4) 
and  g(6)  varied  the  pattern  of  vocal  fold  closure.  According  to  our  experience,  a different 
closure  pattern  might  lead  to  a change  of  the  perceived  quality.  Although  our  empirical 
formula  for  constructing  the  excitation  pulse  suits  a variety  of  voices,  at  present  we  do  not 

Buzziness  was  reported  in  some  synthetic  speech  of  female  speakers,  especially  for 
females  with  high  fundamental  frequencies  of  voicing.  Using  a visual  comparison  presented 
in  Figure  4-9,  we  observe  that  the  synthetic  speech  waveform  for  female  voices,  in  general, 
has  a direct  bearing  on  the  the  synthetic  excitation,  which  shows  a more  rapid  uprising  slope 
at  the  GCI  and  less  noisy  components  during  the  glottal  open  phase.  This  leads  us  to  suspect 
that  the  vocal  fold  closure  pattern  that  is  suitable  for  male  voices  may  be  loo  strong  for  some 
female  voices.  Likewise,  die  level  of  vocal  noise  for  males  may  not  be  appropriate  for 


(d) 

Figure  4-9.  Comparison  of  the  excitation  and  speech  signals  for  a segment  of  voiced 
speech  uttered  by  a female:  (a)  ideal  excitation  (residue),  (b)  synthetic 
excitation  (glottal  impulses),  (c)  original  speech,  (d)  synthetic  speech. 


females.  Essentially,  an  ideal  glottal  impulse  should  possess  certain  flexibilities  in  mixing 
the  periodic  pulses  and  vocal  noise  while  preserving  the  necessary  peakiness,  the  spectral 
tilt  of  the  harmonic  spectrum  and  the  intensity  of  the  fundamental  component. 

Roughness  was  also  occasionally  perceived  as  a degradation  of  female  synthetic 
speech.  Since  the  pitch  irregularity  has  been  considered  to  be  an  important  correlate  of 
roughness,  the  listening  test  results  indicate  the  imperfection  of  our  GCI  identification 
algorithm,  which  relied  on  asharp  negative  peak  to  havea  proper  initialization  and  consistent 
similarities  of  adjacent  pitch  periods  to  capture  the  rest  GCI's.  In  case  this  peak  was  smeared 
by  nonstationary  turbulence,  the  incoirect  GCI’s  resulted  in  a domino  effect  at  the  laner 
stages  that  even  our  pitch  smoothing  procedure  could  not  fully  counteract 

Other  perceivable  distortions  occurred  in  segments  containing  fricatives  and  nasal 
consonants.  This  implies  that  our  excitation  model  can  only  partially  replicate  the  spectral 
zeros  (anti-formants).  Because  the  observed  phase  characteristics  of  the  nasals  are  not 
significantly  different  from  that  of  the  vowels,  by  inference,  nasal  sounds  are  not  necessarily 
required  in  the  codcbook  training.  This  inference  has  been  further  confirmed  by  testing  the 
glottal  codebook  trained  without  nasals.  No  significant  degradation  was  found  for 
synthesized  speech  using  such  a codcbook. 


CHAPTERS 

CONCLUDING  REMARKS 

11  Summary 

We  confronled  several  problems  in  the  first  phase  of  this  research.  Attempts  were 
made  to  verify  the  relationship  between  the  residue  signal  and  the  glottal  flow  waveform. 
We  concluded  that  the  vocal  characteristics  could  be  retrieved  from  the  integrated  residue, 
which  resembled  the  differentiated  glottal  flow.  Also,  within  the  source-filter  theory,  we 
proposed  a comprehensive  speech  model  that  would  encompass  acoustic  features  previously 
used  as  quality  attributes.  The  role  of  each  model  parameter  was  interpreted  in  the  context 
of  the  acoustic  measure.  Thus,  by  making  use  of  the  LP  analysis  with  the  aid  of  EGG  signals, 
we  proposed  methods  for  isolating  and  extracting  the  acoustic  features.  In  particular,  the 
perturbations  of  vocal  source  were  decomposed  into  low-frequency  drifts  and  wideband 
noise,  where  the  latter  was  extracted  by  using  a DFT  method  and  later  applied  to  derive  the 
%jitter  and  ((shimmer  defined  by  Eqs.  (2-15)  and  (2-16).  The  glottal  spectral  tilt  was 
estimated  using  LP  analysis  of  speech  signals.  While  the  removal  of  the  spectral 
characteristics  was  performed  by  inverse  filtering,  the  glottal  phase  was  described  by  the 
waveshape  of  the  integrated  residue  and  a novel  measure  called  abruptness  index.  The  vocal 
noise  was  extracted  from  the  integrated  residue  using  a time  domain  approach  and  examined 
in  three  aspects,  i.c.,  the  relative  power  level,  the  amplitude  modulation,  and  the  noise 

sustained  vowels/i/’s  of  three  voice  types  (modal,  vocal  fry.  and  breathy  voices)  as  examples. 
The  outcomes  of  these  acoustic  measures  were  carefully  investigated  and  we  have  reached 
the  following  conclusions: 
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(1)  As  listed  in  Table  2-3,  the  distributions  of  %jiiterand  %shimmer  for  three  voice 


types  generally  agreed  with  other  researchers'  results.  More  important,  these  results 
substantiated  our  assumptions  that  the  perturbation  noise  exhibited  a Gaussian  distribution 
in  which  the  standard  deviation  was  sufficient  to  characterize  the  statistical  property.  If  we 
consider  the  quality  in  a broader  perspective,  the  gross  pitch  and  intensity  variations  of 
speech  signals  not  only  transmit  linguistic  messages  but  also  non-linguistic  information  such 
as  intonation,  emotional  stress,  and  speaker  idiosyncrasy.  In  order  to  synthesize  natural 
speech,  an  accurate  and  faithful  replication  of  these  variations  is  necessary. 

(2)  The  turbulent  noise  in  breathy  voices  was  perceptually  distinctive  and 
acoustically  discriminable  from  that  in  the  other  two  voice  types.  This  underscores  the  need 
for  a vocal  noise  model  in  the  source  model.  The  noise  spectra  for  different  phonations  were 
fairly  flat  and  therefore  a white  noise  was  suitable  for  modeling  the  vocal  noise.  On  the  other 
hand,  although  the  amplitude  modulations  of  the  vocal  noise  generally  resembles  the 
magnitudes  of  the  integrated  residues,  we  proved  that  this  result  could  result  from  the  phase 
misalignment. 

(3)  The  estimation  of  the  glottal  spectral  tilt  using  LP  analysis  on  the  speech  signal 
was  tested  with  satisfactory  results.  In  addition  to  visual  inspection  of  the  magnitude  spectra 
of  modeled  filters,  a simple  comparison  can  also  be  carried  out  by  testing  the  first  coefficient 
of  the  underlying  filter.  The  spectral  tilts  arc  moderate,  relative  flat,  and  steep  for  modal, 
vocal  fry  and  breathy  voices,  respectively. 

(4)  The  glottal  phase  characteristics  did  not  show  any  significant  relation  across 
different  voice  types,  suggesting  no  general  rules  for  modeling  the  phase  characteristics  for 
different  voice  types.  The  abruptness  index,  in  contrast,  showed  great  potential  for 
discriminating  voice  types,  because  the  associated  measures  for  each  voice  type  are  highly 
self-clustered  and  well  separated  from  one  another. 

The  above  results  provide  a general  idea  of  glottal  variabilities.  More  extensive 
investigations  are  needed  to  establish  the  statistical  significances  between  model  parameters 


vocal  quality.  The  LP  analysis  appears  i 


properties  as  well  as  the  formant  patterns.  Thus,  it  is  reasonable  for  us  to  argue  that  a high 
quality  LP  synthesizer  is  achievable  if  the  acoustic  features  are  accurately  estimated  and 
faithfully  reproduced. 

In  the  second  phase  of  this  research,  we  were  interested  in  the  design  of  a high-quality 
natural-sounding  LP  synthesizer.  In  Chapter  3.  we  presented  an  excitation  model  to  simulate 
the  voiced  residue  by  the  glottal  impulses  and  the  unvoiced  residue  by  the  innovation 
sequences.  These  two  types  of  excitations  were  further  formulated  as  two  codebooks  geared 
to  an  all-pole  filter,  of  which  the  coefficients  are  estimated  using  the  orthogonal  covariance 
method.  Schemes  for  speech  analysis  and  synthesis  were  discussed  in  Chapter  4. 
Experiments  with  this  new  model  and  processing  schemes  demonstrated  the  competency  of 
producing  natural  sounding  speech.  In  addition  to  source  modeling,  we  believe  efforts  that 
lead  to  such  encouraging  results  include  the  methods  and  algorithms  performing  the  GCI 
identification,  codeword  searching,  piece-wise  LP  interpolation,  glottal  pulse  smoothing, 
spectral  adjustment,  source-tract  interaction  and  gain  determination.  These  are  either 
introduced  for  the  first  time  in  the  literature  or  have  had  some  modifications.  Our 
achievements  can  be  appreciated  by  appraising  the  quality  of  synthetic  speech. 


Though  our  LP  synthesizer  has  been  tested  with  fairly  high  success,  there  is  still  room 
for  extension.  Several  possible  improvements  are  suggested  as  follows. 

5.2.1  Extraction  of  Vocal  Noise 

Our  noise  extraction  algorithm  was  impeded  by  the  difficulty  of  phase  misalignment 
While  the  pilch  delay  is  always  restricted  to  integer  multiples  of  the  sampling  (or  resampling) 
interval,  a possible  method  for  overcoming  this  drawback  is  the  use  of  a pitch  predictor,  for 
it  not  only  provides  the  necessary  interpolation  but  also  maximizes  the  correlation  between 


the  analyzed  signals.  In  general,  the  number  of  filter  taps  need  not  be  too  many  and  the 
associated  coefficients  can  be  easily  obtained  by  minimizing  the  mean  squared  error  between 
the  two  signals.  However,  since  the  noise  must  be  measured  at  the  level  of  source  excitation, 
more  studies  should  be  made  concerning  the  effect  of  the  sequential  order  of  the  pilch 
predictor  and  the  inverse  filter.  Furthermore,  as  we  already  pointed  out  in  Section  2.7.  there 
are  two  types  of  noise  presented  in  the  residue  signal,  namely,  the  noise  associated  with  the 
epoch  variation  and  that  with  the  airflow  turbulence.  One  may  consider  how  to  decompose 
the  residue  signal  into  two  such  components,  thus  forcing  the  pitch  predictor  to  examine  the 
epoch  or  airflow  variations  separately.  This  will  allow  us  to  enlarge  our  view  of  the  vocal 


The  improvement  of  performance  and  reliability  of  the  GCI  identification  algorithm 
becomes  an  urgent  requirement  for  high  quality  speech  synthesis.  In  this  research,  we 
located  the  GCI's  by  first  choosing  the  largest  negative  peak  of  the  integrated  residue  in  a 
frame  as  a reference  mark  and  then  searching  for  the  other  peaks  by  a maximum  correlation 
approach.  The  resultant  synthetic  speech  suffers  probable  distortion  emerging  from  the 
inaccurate  pitch  identification.  Thus,  we  have  to  rely  on  some  correctional  procedures  to 
rectify  some  errant  pitch  transitions.  It  was  reported  that  the  GCI  identification  could 
achieve  good  performance  if  the  maximum  correlation  approach  was  directly  applied  to  the 
speech  signal  (Cheng  and  O'shaughncssy,  1989).  Although  our  experiments  with  Cheng's 
approach  did  show  some  promising  results,  this  approach  has  to  be  further  refined  before  it 
can  function  automatically. 

5,2.3  Excimion  Source 

Meanwhile,  we  are  concerned  with  the  excitation  function  for  normal  voices,  leaving 


much  latitude  for  the  modification  of  the  glottal  codebook.  For  the  utterance  with  elongated 


glottal  open  phases,  the  increased  air  turbulence  will  certainly  perturb  the  accuracy  of  the 
inverse  filter.  Therefore,  not  only  is  the  primary  residue  pulse  less  distinctive,  but  also  there 
are  other  spurious  components.  It  is  obvious  this  type  of  residue  cannot  be  properly 
described  by  a sharp  glottal  impulse.  Also,  based  upon  our  perceptual  impression,  we 
believe  that  the  strong  excitation  pulses  should  at  least  be  partially  responsible  for  the  buzzy 
characteristics  of  female  synthetic  speech.  One  may  think  about  ameliorating  such  defects 
by  raising  the  vocal  noise.  According  to  our  experience,  adding  noise  did  increase  the 
breathiness  but  could  not  actually  soften  the  voices.  We  therefore  have  to  resort  to  other 
means.  Several  efforts  in  the  past  had  been  directed  to  designing  excitation  signals  with  low 
peak  factors  and  flat  magnitude  spectra  (Schroeder,  1970;  Rabiner  and  Crochiere,  1975). 
Apparently,  a logical  follow-up  of  this  research  will  be  to  consider  designing  a different 
codebook  or  developing  processing  schemes  that  control  the  peak  factor  and  the  vocal  noise 
as  well. 


The  ripple  effect  is  known  to  be  important  for  the  improvement  of  speech  quality, 
but  it  can  only  be  simulated  empirically  in  our  experiments  due  to  the  lack  of  an  efficient 
method  for  estimating  the  formant  damping.  What  is  the  proper  amount  of  the  formant  ripple 
that  should  be  incorporated  into  the  speech  synthesizer  in  order  to  produce  natural  sounding 
speech?  If  we  believe  that  the  changes  in  formant  damping  only  occur  at  the  transitions  from 
open-to-closed  orclosed-to-open  glottis,  then  the  utilization  of  techniques  developed  for  fast 
formant  tracking  (Ting  and  Childers,  1990)  and  for  the  estimation  of  exponentially  damped 
sinusoids  (Parthasarathy  and  Tufts,  1987)  may  offer  answers  to  this  question.  More  studies 
are  needed  to  decide  how  to  control  the  damping  factors,  and  how  these  variations  affect  the 
quality.  Even  though  the  estimation  procedures  may  be  computationally  prohibited  from 
practical  use,  we  would  at  least  gain  some  qualitative  description  that  might  characterize 


different  groups  of  speakers. 


the  use  of  a multiple-tap  pilch  predictor  dial  we  discussed  in  Section  5.2.1.  The  other  is  the 
fractional  interpolation  mentioned  in  Section  2.5.4.  We  believe  the  interpolated  values  of 
the  signal  will  be  able  to  represent  the  voiced  speech  more  accurately  and  achieve  an 
improvement  of  the  synthetic  quality  for  female  speakers. 


In  the  proposed  LP  speech  synthesizer,  the  spectra  of  speech  signals  are  represented 
by  an  all-pole  model.  This  model,  however,  is  not  suited  for  nasals  and  consonants,  for  the 
spectral  envelopes  of  such  sounds  exhibit  dips  (zeros)  besides  peaks  (poles).  Though  our 
source  excitadon  partially  compensates  for  the  absence  of  spectral  zeros  by  exploiting  the 
adapUvc  nature  of  codeword  searching,  an  Autoregressive  Moving  Average  (ARMA)  model 
seems  to  be  even  more  attractive  and  straightforward  for  improving  the  quality  of  synthetic 
speech  when  spectral  zeros  arc  perceptually  important  (Atal  and  Schroeder,  1978;  Akamine 
and  Kiseki.  1989). 


There  are  at  least  four  areas  where  the  techniques  developed  in  this  research  are 
applicable. 


In  this  research  we  have  postulated  that  a comprehensive  speech  production  model 
should  be  constituted  by  a complete  set  of  acoustic  features.  Thus,  by  utilizing  the  feature 

perceptually  objective  quality  (or  distortion)  measure  in  a full  context  by  verifying  the 
relationships  between  the  psychoacoustic  attributes  and  the  model  parameters.  Two  possible 


5.3  Applications 


! described  objective  measure  are:  (1)  statistical  analysis 


approaches  that 


based  on  large  amounts  of  data,  and  (2)  the  analysis-by-synthcsis  procedure. 

In  addition  to  assessing  the  speech  quality  (or  severity),  the  resultant  distortion 


identification  techniques  can  be  advanced.  Such  a measure  can  also  be  used  to  study  the  way 
the  human  processes  the  acoustic  signals  so  as  to  improve  the  speech  recognition  techniques. 
Furthermore,  for  any  laryngeal  pathology  that  is  acoustically  perceptible,  the  distortion 
measure  can  provide  insight  toward  the  classification  of  the  pathologies. 

5,3.2  .Speech  Coding 

We  have  not  spent  much  effort  dealing  with  the  process  of  quantization  and  coding, 
but  one  may  anticipate  the  superiority  of  the  proposed  model  in  speech  coding.  Some 
advantages  can  be  easily  figured  out  by  comparing  the  size  of  codebook  for  the  various 

by  the  DoD  4.8  Kb/s  voice  coder  (Campbell  et  al..  1989).  In  addition,  only  one  codeword 
is  reserved  to  characterize  the  residue  signal  for  each  frame.  This  will  greatly  reduce  the 


Because  every  codeword  represents  a different  pattern  of  the  glottal  phase 
characteristics,  the  codeword  searching  can  be  considered  as  a process  of  monitoring  and 
tracing  the  phase  variation.  This  implies  that  the  glottal  codebook  can  also  be  used  to  study 
the  phase  properties.  As  an  example,  we  have  applied  one  single  codeword  to  synthesize  an 
entire  sentence  while  keeping  other  parameters  unchanged.  The  synthesized  speech  was 
intelligent  but  not  as  natural  as  the  one  synthesized  by  using  the  selected  codeword  sequence. 
Evidently,  the  poor  quality  is  caused  by  the  fixed  glottal  phase.  This  fact  also  indicates 
another  possible  application  of  this  source  model  which  is  unfeasible  by  other  waveform 


overt  the  glottal  pha 


coders.  That  is,  our  model  can  be  used  to  con 
past,  the  UP  synthesizers  were  only  used  to  perform  the  prosodic  and  spectral  modifications 
in  the  voice  conversion  systems  (Childers  etal.,  1989b;  Savic  and  Nam,  1991;  Valbretetal., 
1992).  The  acoustic  realization  asing  our  model  will  render  a complete  version  for  voice 
conversion. 


The  text-to-speech  system  is  considered  as  the  hierarchy  of  converting  the  text  to  a 
sequence  of  phonetic  transcriptions  before  producing  the  acoustic  output,  which  includes 
semantic,  synthetic  and  lexical  rules  that  monitor  various  intermediate  transformauon 
(Klatt,  1987).  The  final  phonetic  transcription  is  represented  by  the  formant  patterns  (or  LP 
coefficients),  duration,  pilch  and  intensity  contours,  all  of  which  correspond  the  parameters 
of  a speech  production  model.  Thus,  if  the  LP  coefficient  vectors  for  various  phonetic 
segments  were  already  registered  in  a codebook,  we  can  easily  apply  our  excitation 
codcbooks  to  produce  speech  signals  with  desired  quality.  In  fact  nowadays  many  LP 
codebooks  arc  used  for  speech  coding  and  have  demonstrated  high  performance.  It  seems 
that  the  incorporation  of  this  LP  codebook  into  our  current  model  is  not  only  hypothetically 
feasible  but  also  is  a simple  way  to  integrate  the  work  of  speech  synthesis. 
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